<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Nicoda-27</title>
    <description>The latest articles on Forem by Nicoda-27 (@nda_27).</description>
    <link>https://forem.com/nda_27</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F2866016%2F6f804060-53c1-482e-ad43-288a0ba75a99.jpg</url>
      <title>Forem: Nicoda-27</title>
      <link>https://forem.com/nda_27</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/nda_27"/>
    <language>en</language>
    <item>
      <title>How to be Test Driven with Spark: Chapter 6: Improve the setup using devcontainer</title>
      <dc:creator>Nicoda-27</dc:creator>
      <pubDate>Fri, 17 Apr 2026 21:29:29 +0000</pubDate>
      <link>https://forem.com/nda_27/how-to-be-test-driven-with-spark-chapter-6-improve-the-setup-using-devcontainer-5dj8</link>
      <guid>https://forem.com/nda_27/how-to-be-test-driven-with-spark-chapter-6-improve-the-setup-using-devcontainer-5dj8</guid>
      <description>&lt;p&gt;This goal of this tutorial is to provide a way to easily be test driven with spark on your local setup without using cloud resources.&lt;/p&gt;

&lt;p&gt;This is a series of tutorials and the initial chapters can be found in:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://dev.to/nda_27/how-to-be-tdd-with-spark-chapter-0-and-1-modern-python-setup-3df8"&gt;Chapter 0 and 1&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://dev.to/nda_27/how-to-be-test-driven-with-spark-2-ci-4a28"&gt;Chapter 2&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://dev.to/nda_27/how-to-be-test-driven-with-spark-chapter-3-first-spark-test-le"&gt;Chapter 3&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://dev.to/nda_27/how-to-be-test-driven-with-spark-chapter-4-leaning-into-property-based-testing-2hln"&gt;Chapter 4&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://dev.to/nda_27/how-to-be-test-driven-with-spark-chapter-5-leverage-spark-in-a-container-1p74"&gt;Chapter 5&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We will be focusing on chapter 6 on improving the developer system for better reusability, reproducibility and also leverage this approach on the ci setup.&lt;/p&gt;

&lt;p&gt;In &lt;a href="https://dev.to/nda_27/how-to-be-test-driven-with-spark-2-ci-4a28"&gt;chapter 2&lt;/a&gt;, we mentioned devcontainers as a way to make the development environment explicit.&lt;/p&gt;

&lt;p&gt;A &lt;a href="https://code.visualstudio.com/docs/devcontainers/containers" rel="noopener noreferrer"&gt;development container (devcontainer)&lt;/a&gt; describes the developer environment as an OCI image (often built with a &lt;code&gt;Dockerfile&lt;/code&gt;). The usual runtime is Docker, but tools such as &lt;a href="https://podman.io/" rel="noopener noreferrer"&gt;Podman&lt;/a&gt; are compatible with the same workflow. For simplicity, this chapter assumes Docker is installed on your machine.&lt;/p&gt;

&lt;p&gt;The full specification lives in the &lt;a href="https://containers.dev/implementors/spec/" rel="noopener noreferrer"&gt;Dev Container Specification&lt;/a&gt; on &lt;a href="https://containers.dev/" rel="noopener noreferrer"&gt;containers.dev&lt;/a&gt;. What follows is only a small subset of what devcontainers can express.&lt;/p&gt;

&lt;h2&gt;
  
  
  The devcontainer specification
&lt;/h2&gt;

&lt;p&gt;The repository uses a &lt;a href="https://github.com/Nicoda-27/spark_tdd/tree/doc/chapter_6/.devcontainer" rel="noopener noreferrer"&gt;.devcontainer&lt;/a&gt; directory to hold the image definition. The &lt;a href="https://github.com/Nicoda-27/spark_tdd/tree/doc/chapter_6/.devcontainer/Dockerfile" rel="noopener noreferrer"&gt;Dockerfile&lt;/a&gt; is the main build recipe; we walk through it below.&lt;/p&gt;

&lt;p&gt;The first line selects the Dockerfile syntax version. The base image is Debian (&lt;code&gt;debian:trixie-slim&lt;/code&gt;); you can swap it for another image if you need a smaller footprint or a different distribution.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight docker"&gt;&lt;code&gt;&lt;span class="c"&gt;# syntax=docker/dockerfile:1.4&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;debian:trixie-slim&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;AS&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;build&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The optional &lt;code&gt;FORCE_REBUILD&lt;/code&gt; argument is a cache-busting knob: changing its default value invalidates Docker’s layer cache for everything that follows, which is useful when you want a full rebuild without editing other lines.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight docker"&gt;&lt;code&gt;&lt;span class="k"&gt;ARG&lt;/span&gt;&lt;span class="s"&gt; FORCE_REBUILD=20260417&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;As in &lt;a href="https://dev.to/nda_27/how-to-be-tdd-with-spark-chapter-0-and-1-modern-python-setup-3df8"&gt;chapter 1&lt;/a&gt;, &lt;a href="https://mise.jdx.dev/" rel="noopener noreferrer"&gt;mise&lt;/a&gt; drives tool versions. The &lt;code&gt;mise.toml&lt;/code&gt; file is copied into the build context so &lt;code&gt;mise install&lt;/code&gt; can install &lt;code&gt;uv&lt;/code&gt; (and anything else declared there).&lt;/p&gt;

&lt;p&gt;Extra environment variables pin where &lt;code&gt;mise&lt;/code&gt; and &lt;code&gt;uv&lt;/code&gt; install binaries and Python:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight docker"&gt;&lt;code&gt;&lt;span class="k"&gt;COPY&lt;/span&gt;&lt;span class="s"&gt; mise.toml /mise.toml&lt;/span&gt;

&lt;span class="k"&gt;ENV&lt;/span&gt;&lt;span class="s"&gt; UV_TOOL_BIN_DIR=/usr/local/bin \&lt;/span&gt;
    UV_TOOL_DIR=/opt/uv/venv \
    UV_PYTHON_INSTALL_DIR=/opt/uv/python \
    MISE_DATA_DIR=/opt/mise

&lt;span class="k"&gt;ENV&lt;/span&gt;&lt;span class="s"&gt; PATH="$MISE_DATA_DIR/shims:$PATH"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;System packages and tooling (for example &lt;code&gt;git&lt;/code&gt;, &lt;code&gt;zip&lt;/code&gt;, and the Docker CLI) are installed in &lt;a href="https://github.com/Nicoda-27/spark_tdd/tree/doc/chapter_6/.devcontainer/devcontainer-setup.sh" rel="noopener noreferrer"&gt;devcontainer-setup.sh&lt;/a&gt;, which is copied in and executed next:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight docker"&gt;&lt;code&gt;&lt;span class="k"&gt;COPY&lt;/span&gt;&lt;span class="s"&gt; devcontainer-setup.sh /devcontainer-setup.sh&lt;/span&gt;
&lt;span class="k"&gt;RUN &lt;/span&gt;/devcontainer-setup.sh

&lt;span class="k"&gt;WORKDIR&lt;/span&gt;&lt;span class="s"&gt; /code&lt;/span&gt;

&lt;span class="k"&gt;FROM&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;build&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;AS&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;devcontainer&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The final stage &lt;code&gt;devcontainer&lt;/code&gt; matches the &lt;code&gt;target&lt;/code&gt; in &lt;a href="https://github.com/Nicoda-27/spark_tdd/tree/doc/chapter_6/.devcontainer/devcontainer.json" rel="noopener noreferrer"&gt;devcontainer.json&lt;/a&gt;, which also selects the Dockerfile, platform, and IDE extensions (here, the Python extension for VS Code).&lt;/p&gt;

&lt;p&gt;Build the image from the repository root:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker build &lt;span class="nt"&gt;-f&lt;/span&gt; .devcontainer/Dockerfile &lt;span class="nt"&gt;--target&lt;/span&gt; devcontainer .devcontainer
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Using the devcontainer in your IDE
&lt;/h2&gt;

&lt;p&gt;Modern editors can open the project &lt;em&gt;inside&lt;/em&gt; the container using a devcontainer extension—for VS Code, see &lt;a href="https://code.visualstudio.com/docs/devcontainers/containers" rel="noopener noreferrer"&gt;Developing inside a Container&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;That gives newcomers a reproducible environment: the extension detects &lt;code&gt;.devcontainer/&lt;/code&gt;, builds (or pulls) the image using &lt;code&gt;devcontainer.json&lt;/code&gt;, and starts a shell where tools from the image are already on &lt;code&gt;PATH&lt;/code&gt;. Much of what &lt;a href="https://dev.to/nda_27/how-to-be-tdd-with-spark-chapter-0-and-1-modern-python-setup-3df8"&gt;chapter 1&lt;/a&gt; described as manual setup becomes versioned files in the repo, which you can test in CI so they stay accurate.&lt;/p&gt;

&lt;h2&gt;
  
  
  Using the devcontainer in CI
&lt;/h2&gt;

&lt;p&gt;Reusing the same image in continuous integration avoids depending on whatever happens to be preinstalled on the GitHub-hosted runner: the maintainer owns the image, so runner image updates do not silently change your pipeline. That improves &lt;strong&gt;reproducibility&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The workflow &lt;a href="https://github.com/Nicoda-27/spark_tdd/blob/doc/chapter_6/.github/workflows/ci.yaml" rel="noopener noreferrer"&gt;.github/workflows/ci.yaml&lt;/a&gt; implements this pattern.&lt;/p&gt;

&lt;h3&gt;
  
  
  Image tag
&lt;/h3&gt;

&lt;p&gt;A tag is derived from a hash of every file under &lt;code&gt;.devcontainer/&lt;/code&gt;, so the image only changes when that folder’s content changes (see &lt;a href="https://docs.github.com/en/actions/reference/workflows-and-actions/expressions#hashfiles" rel="noopener noreferrer"&gt;&lt;code&gt;hashFiles&lt;/code&gt;&lt;/a&gt;). The tag is written to &lt;strong&gt;&lt;code&gt;$GITHUB_OUTPUT&lt;/code&gt;&lt;/strong&gt; (so later jobs can use &lt;code&gt;needs.build-and-push.outputs.tag&lt;/code&gt;) and to &lt;strong&gt;&lt;code&gt;$GITHUB_ENV&lt;/code&gt;&lt;/strong&gt; as &lt;code&gt;DEVCONTAINER_TAG&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Compute devcontainer image tag&lt;/span&gt;
        &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;devcontainer_tag&lt;/span&gt;
        &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
          &lt;span class="s"&gt;TAG="devcontainer-${{ hashFiles('.devcontainer/**') }}"&lt;/span&gt;
          &lt;span class="s"&gt;echo "tag=${TAG}" &amp;gt;&amp;gt; "$GITHUB_OUTPUT"&lt;/span&gt;
          &lt;span class="s"&gt;echo "DEVCONTAINER_TAG=${TAG}" &amp;gt;&amp;gt; "$GITHUB_ENV"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;build-and-push&lt;/code&gt; job exposes that tag to other jobs with &lt;code&gt;outputs.tag: ${{ steps.devcontainer_tag.outputs.tag }}&lt;/code&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Login, pull cache, then build if missing
&lt;/h3&gt;

&lt;p&gt;The job logs in to Docker Hub, then tries to &lt;strong&gt;pull&lt;/strong&gt; the image. If that tag already exists in the registry, the build is skipped; otherwise Buildx builds and pushes.&lt;/p&gt;

&lt;p&gt;Configure a repository &lt;strong&gt;variable&lt;/strong&gt; &lt;code&gt;DOCKERHUB_REPOSITORY&lt;/code&gt; (for example &lt;code&gt;youruser/spark-tdd-devcontainer&lt;/code&gt;) and &lt;strong&gt;secrets&lt;/strong&gt; &lt;code&gt;DOCKERHUB_USERNAME&lt;/code&gt; and &lt;code&gt;DOCKERHUB_TOKEN&lt;/code&gt;. The &lt;code&gt;container.image&lt;/code&gt; field cannot use the &lt;code&gt;secrets&lt;/code&gt; context for the image name, which is why the repository name lives in &lt;strong&gt;&lt;code&gt;vars&lt;/code&gt;&lt;/strong&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Log in to Docker Hub&lt;/span&gt;
        &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;docker/login-action@v3&lt;/span&gt;
        &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;username&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ secrets.DOCKERHUB_USERNAME }}&lt;/span&gt;
          &lt;span class="na"&gt;password&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ secrets.DOCKERHUB_TOKEN }}&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Pull devcontainer image if already published&lt;/span&gt;
        &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;pull&lt;/span&gt;
        &lt;span class="na"&gt;continue-on-error&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
        &lt;span class="na"&gt;env&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;REPO&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ vars.DOCKERHUB_REPOSITORY }}&lt;/span&gt;
        &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;docker pull "${REPO}:${DEVCONTAINER_TAG}"&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Set up Docker Buildx&lt;/span&gt;
        &lt;span class="na"&gt;if&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;steps.pull.outcome != 'success'&lt;/span&gt;
        &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;docker/setup-buildx-action@v3&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Build and push devcontainer image&lt;/span&gt;
        &lt;span class="na"&gt;if&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;steps.pull.outcome != 'success'&lt;/span&gt;
        &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;docker/build-push-action@v6&lt;/span&gt;
        &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;context&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;.devcontainer&lt;/span&gt;
          &lt;span class="na"&gt;file&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;.devcontainer/Dockerfile&lt;/span&gt;
          &lt;span class="na"&gt;target&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;devcontainer&lt;/span&gt;
          &lt;span class="na"&gt;push&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
          &lt;span class="na"&gt;tags&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ vars.DOCKERHUB_REPOSITORY }}:${{ env.DEVCONTAINER_TAG }}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Downstream jobs
&lt;/h3&gt;

&lt;p&gt;Formatting and tests run &lt;strong&gt;inside&lt;/strong&gt; that image via &lt;code&gt;jobs.&amp;lt;job_id&amp;gt;.container&lt;/code&gt;, using the tag exported by the &lt;code&gt;build-and-push&lt;/code&gt; job output (still driven by the same &lt;code&gt;devcontainer_tag&lt;/code&gt; step):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;  &lt;span class="na"&gt;Formatting&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;runs-on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ubuntu-latest&lt;/span&gt;
    &lt;span class="na"&gt;needs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;build-and-push&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
    &lt;span class="na"&gt;container&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ vars.DOCKERHUB_REPOSITORY }}:${{ needs.build-and-push.outputs.tag }}&lt;/span&gt;
      &lt;span class="na"&gt;credentials&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;username&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ secrets.DOCKERHUB_USERNAME }}&lt;/span&gt;
        &lt;span class="na"&gt;password&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ secrets.DOCKERHUB_TOKEN }}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The test job also mounts the host Docker socket so &lt;a href="https://testcontainers.com/" rel="noopener noreferrer"&gt;Testcontainers&lt;/a&gt; can start sibling containers (for example Spark) from within the job container.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;We have now documented the developer setup as code and it's tested. It's a great step toward "code as documentation".&lt;/p&gt;

&lt;p&gt;You can find the original materials in &lt;a href="https://github.com/Nicoda-27/spark_tdd/tree/doc/chapter_5" rel="noopener noreferrer"&gt;spark_tdd&lt;/a&gt;. This repository exposes what's the expected repository layout at the end of each chapter in each branch:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://github.com/Nicoda-27/spark_tdd/tree/doc/chapter_0" rel="noopener noreferrer"&gt;Chapter 0&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/Nicoda-27/spark_tdd/tree/doc/chapter_1" rel="noopener noreferrer"&gt;Chapter 1&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/Nicoda-27/spark_tdd/tree/doc/chapter_2" rel="noopener noreferrer"&gt;Chapter 2&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/Nicoda-27/spark_tdd/tree/doc/chapter_3" rel="noopener noreferrer"&gt;Chapter 3&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/Nicoda-27/spark_tdd/tree/doc/chapter_4" rel="noopener noreferrer"&gt;Chapter 4&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/Nicoda-27/spark_tdd/tree/doc/chapter_5" rel="noopener noreferrer"&gt;Chapter 5&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/Nicoda-27/spark_tdd/tree/doc/chapter_6" rel="noopener noreferrer"&gt;Chapter 6&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What's next
&lt;/h2&gt;

&lt;p&gt;Several ideas come to mind on how to improve our very small codebase&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Rework the spark container to prebuild the docker image, as it can be quite slow when extra package like deltalake, dremio are necessary &lt;/li&gt;
&lt;li&gt;Templatize the repository for easier reusage with the help of &lt;a href="https://github.com/ffizer/ffizer" rel="noopener noreferrer"&gt;ffizer&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Explore &lt;a href="https://github.com/ibis-project/ibis?tab=readme-ov-file" rel="noopener noreferrer"&gt;ibis&lt;/a&gt; to handle multiple transformation backends transparently&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>python</category>
      <category>devcontainer</category>
      <category>ci</category>
    </item>
    <item>
      <title>How to for developers: Mastering your corporate MacBook Setup</title>
      <dc:creator>Nicoda-27</dc:creator>
      <pubDate>Sat, 17 May 2025 09:51:05 +0000</pubDate>
      <link>https://forem.com/nda_27/how-to-for-developers-mastering-your-corporate-macbook-setup-5eoe</link>
      <guid>https://forem.com/nda_27/how-to-for-developers-mastering-your-corporate-macbook-setup-5eoe</guid>
      <description>&lt;p&gt;Starting with a fresh &lt;em&gt;MacBook&lt;/em&gt; can be exciting, but navigating corporate IT requirements can feel daunting. This article demystifies the process with a step-by-step guide so you can set up your machine smoothly and in line with company policy, and stay productive from day one.&lt;/p&gt;

&lt;p&gt;This article focuses on a &lt;em&gt;Python&lt;/em&gt; developer persona, but most of it applies to any developer.&lt;/p&gt;

&lt;h2&gt;
  
  
  A corporate &lt;em&gt;MacBook&lt;/em&gt;
&lt;/h2&gt;

&lt;p&gt;Depending on your company, the &lt;em&gt;MacBook&lt;/em&gt; provided as a developer workstation can be harder to work with than a privately owned &lt;em&gt;MacBook&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;Namely:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You will not have &lt;code&gt;sudo&lt;/code&gt; on your workstation.&lt;/li&gt;
&lt;li&gt;A proxy may be required by company policy.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Not having &lt;code&gt;sudo&lt;/code&gt; is the main constraint; we will see how to work within it and still comply with policy—that means we will not bypass controls, but we will use what macOS allows.&lt;/p&gt;

&lt;h2&gt;
  
  
  Homebrew
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://brew.sh/" rel="noopener noreferrer"&gt;&lt;em&gt;Homebrew&lt;/em&gt;&lt;/a&gt; is the go-to package manager for developers on &lt;em&gt;macOS&lt;/em&gt;. It plays a similar role to &lt;a href="https://documentation.ubuntu.com/server/how-to/software/package-management/index.html" rel="noopener noreferrer"&gt;&lt;code&gt;apt&lt;/code&gt;&lt;/a&gt; on &lt;em&gt;Ubuntu&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;The official documentation expects &lt;code&gt;sudo&lt;/code&gt; for the default install. You can also install it for your user only with:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;mkdir&lt;/span&gt; &lt;span class="nv"&gt;$HOME&lt;/span&gt;/homebrew &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; curl &lt;span class="nt"&gt;-L&lt;/span&gt; https://github.com/Homebrew/brew/tarball/master | &lt;span class="nb"&gt;tar &lt;/span&gt;xz &lt;span class="nt"&gt;--strip&lt;/span&gt; 1 &lt;span class="nt"&gt;-C&lt;/span&gt; homebrew
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This creates a &lt;code&gt;homebrew&lt;/code&gt; directory under &lt;code&gt;$HOME&lt;/code&gt; where Homebrew is installed.&lt;/p&gt;

&lt;p&gt;Add it to your &lt;code&gt;PATH&lt;/code&gt; permanently. In the rest of this article, &lt;em&gt;zsh&lt;/em&gt; is assumed as your shell (but you can use any shell you prefer).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s1"&gt;'export PATH=$HOME/homebrew/bin:$PATH'&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; ~/.zshrc
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Check that Homebrew works
&lt;/h3&gt;

&lt;p&gt;The following command applies the &lt;code&gt;PATH&lt;/code&gt; change in your current terminal:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;source&lt;/span&gt; ~/.zshrc
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The next command should print a version and confirms Homebrew is installed and usable:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;brew &lt;span class="nt"&gt;--version&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  CLI tools
&lt;/h3&gt;

&lt;p&gt;Install a first tool:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;brew &lt;span class="nb"&gt;install &lt;/span&gt;git
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can install other developer tools available as Homebrew formulae; they are installed for your user without &lt;code&gt;sudo&lt;/code&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  GUI applications
&lt;/h3&gt;

&lt;p&gt;GUI apps need an Applications folder. By default, Homebrew Cask targets &lt;code&gt;/Applications&lt;/code&gt; at the root of the disk, which you may not be allowed to write to.&lt;/p&gt;

&lt;p&gt;For example, try &lt;a href="https://github.com/MuhammedKalkan/OpenLens" rel="noopener noreferrer"&gt;&lt;em&gt;OpenLens&lt;/em&gt;&lt;/a&gt;—a tool that provides a UI to inspect your &lt;em&gt;Kubernetes&lt;/em&gt; cluster. It is available as a &lt;a href="https://formulae.brew.sh/cask/openlens" rel="noopener noreferrer"&gt;Homebrew cask&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;If you run:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;brew &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;--cask&lt;/span&gt; openlens
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;that install can fail because Homebrew cannot use &lt;code&gt;/Applications&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Instead, create an Applications folder under your home directory:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;mkdir&lt;/span&gt; &lt;span class="nv"&gt;$HOME&lt;/span&gt;/Applications
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then tell Homebrew to use it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;brew &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;--cask&lt;/span&gt; openlens &lt;span class="nt"&gt;--appdir&lt;/span&gt; &lt;span class="nv"&gt;$HOME&lt;/span&gt;/Applications
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You should be able to open &lt;em&gt;OpenLens&lt;/em&gt; from Spotlight. You can install GUI apps this way without &lt;code&gt;sudo&lt;/code&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Rosetta
&lt;/h2&gt;

&lt;p&gt;Sometimes you need a specific CPU architecture for a binary (either Apple silicon or Intel).&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Apple silicon&lt;/em&gt; (&lt;em&gt;M1&lt;/em&gt;, &lt;em&gt;M2&lt;/em&gt;, &lt;em&gt;M3&lt;/em&gt;, &lt;em&gt;M4&lt;/em&gt;) Macs use &lt;em&gt;arm64&lt;/em&gt; natively. &lt;em&gt;Rosetta&lt;/em&gt; lets you run &lt;em&gt;x86_64&lt;/em&gt; binaries. That helps when libraries exist only for one architecture; &lt;em&gt;x86_64&lt;/em&gt; is older, so binaries are often available there first.&lt;/p&gt;

&lt;p&gt;You can add aliases to start a shell under Rosetta or natively. Add these to your &lt;code&gt;.zshrc&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;alias &lt;/span&gt;&lt;span class="nv"&gt;arm&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"env /usr/bin/arch -arm64 /bin/zsh --login"&lt;/span&gt;
&lt;span class="nb"&gt;alias &lt;/span&gt;&lt;span class="nv"&gt;intel&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"env /usr/bin/arch -x86_64 /bin/zsh --login"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To see which architecture your current shell is using, run:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;arch&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You should see either &lt;code&gt;i386&lt;/code&gt; (Intel / Rosetta) or &lt;code&gt;arm64&lt;/code&gt; (Apple silicon).&lt;/p&gt;

&lt;h2&gt;
  
  
  Installing language-specific tools
&lt;/h2&gt;

&lt;p&gt;To manage multiple versions of Python, Node, the AWS CLI, Cargo, and more, this guide uses &lt;a href="https://mise.jdx.dev/" rel="noopener noreferrer"&gt;mise&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;You can run two setups—one native and one under Rosetta—so environments match each architecture, as described in the &lt;a href="https://mise.jdx.dev/tips-and-tricks.html#macos-rosetta" rel="noopener noreferrer"&gt;mise macOS Rosetta notes&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Follow that installation path; you should then have the x86_64 binary available as:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;mise-x64 &lt;span class="nt"&gt;--version&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can also use the standard &lt;em&gt;arm64&lt;/em&gt; &lt;code&gt;mise&lt;/code&gt; from Homebrew if you do not need x86_64-specific toolchains:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;brew &lt;span class="nb"&gt;install &lt;/span&gt;mise
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can then install &lt;em&gt;Python&lt;/em&gt;, &lt;em&gt;Node&lt;/em&gt;, and &lt;a href="https://mise.jdx.dev/registry.html#tools" rel="noopener noreferrer"&gt;many other tools&lt;/a&gt; with mise.&lt;/p&gt;

&lt;p&gt;Example: install Python for an x86_64 toolchain:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;mise-x64 use python@3.10
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Containerization tools
&lt;/h2&gt;

&lt;p&gt;For containers, several options exist.&lt;/p&gt;

&lt;p&gt;The most common is &lt;a href="https://www.docker.com/" rel="noopener noreferrer"&gt;&lt;em&gt;Docker&lt;/em&gt;&lt;/a&gt;; open-source alternatives such as &lt;a href="https://podman-desktop.io/" rel="noopener noreferrer"&gt;&lt;em&gt;Podman&lt;/em&gt;&lt;/a&gt; exist as well.&lt;/p&gt;

&lt;h2&gt;
  
  
  Installing PHP
&lt;/h2&gt;

&lt;p&gt;It is a bit fiddly, but doable: you can use &lt;code&gt;mise-x64&lt;/code&gt; as follows:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;PHP_CONFIGURE_OPTIONS&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"--with-openssl=&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;brew &lt;span class="nt"&gt;--prefix&lt;/span&gt; openssl&lt;span class="si"&gt;)&lt;/span&gt;&lt;span class="s2"&gt; --with-iconv=&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;brew &lt;span class="nt"&gt;--prefix&lt;/span&gt; libiconv&lt;span class="si"&gt;)&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; mise-x64 use php@8.4 &lt;span class="nt"&gt;--global&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Because ARM support can be limited, an x86_64 build is often more reliable here.&lt;/p&gt;

&lt;h3&gt;
  
  
  Docker
&lt;/h3&gt;

&lt;p&gt;You will usually need IT to install Docker Desktop, because it requires elevated privileges.&lt;/p&gt;

&lt;h3&gt;
  
  
  Podman
&lt;/h3&gt;

&lt;p&gt;Podman can replace Docker for many workflows and does not require the same elevated setup. Install it with Homebrew or follow the &lt;a href="https://podman-desktop.io/docs/installation/macos-install" rel="noopener noreferrer"&gt;Podman Desktop macOS installation guide&lt;/a&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;brew &lt;span class="nb"&gt;install &lt;/span&gt;podman
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can alias &lt;code&gt;podman&lt;/code&gt; to &lt;code&gt;docker&lt;/code&gt; for convenience in your &lt;code&gt;.zshrc&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;alias &lt;/span&gt;&lt;span class="nv"&gt;docker&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;podman
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The Podman CLI is largely compatible with the Docker CLI.&lt;/p&gt;

&lt;p&gt;Try:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker run hello-world
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;On first use you may need to initialize the Podman machine:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;podman machine init
podman machine start
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;See the &lt;a href="https://podman-desktop.io/docs/troubleshooting/troubleshooting-podman-on-macos" rel="noopener noreferrer"&gt;Podman Desktop macOS troubleshooting&lt;/a&gt; page if something fails.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Disclaimer:&lt;/strong&gt; Some stacks, such as &lt;a href="https://github.com/testcontainers/testcontainers-python" rel="noopener noreferrer"&gt;Testcontainers for Python&lt;/a&gt;, expect a real Docker daemon and may not work fully with Podman. &lt;a href="https://podman-desktop.io/docs/migrating-from-docker/customizing-docker-compatibility" rel="noopener noreferrer"&gt;podman-mac-helper&lt;/a&gt; and socket compatibility can help, but that often still needs cooperation from IT.&lt;/p&gt;

&lt;h2&gt;
  
  
  Proxies and certificates
&lt;/h2&gt;

&lt;p&gt;Your company may use an HTTP proxy with its own TLS certificate. This section skips why that exists and focuses on how to work with it.&lt;/p&gt;

&lt;p&gt;Tools such as the &lt;a href="https://learn.microsoft.com/en-us/cli/azure/?view=azure-cli-latest" rel="noopener noreferrer"&gt;Azure CLI&lt;/a&gt; may fail until trust is configured; see &lt;a href="https://learn.microsoft.com/en-us/cli/azure/use-azure-cli-successfully-troubleshooting?view=azure-cli-latest#work-behind-a-proxy" rel="noopener noreferrer"&gt;Azure CLI: work behind a proxy&lt;/a&gt;. That example is a good check that your proxy and certificates are set up correctly.&lt;/p&gt;

&lt;p&gt;As documented, set &lt;code&gt;REQUESTS_CA_BUNDLE&lt;/code&gt; to the path of your combined CA bundle. The same idea applies when Python tools need HTTPS access to install dependencies.&lt;/p&gt;

&lt;h3&gt;
  
  
  Creating a certificate bundle
&lt;/h3&gt;

&lt;p&gt;Build one PEM file that merges system and corporate roots; you will point tools at it.&lt;/p&gt;

&lt;p&gt;Create a directory and an environment variable:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;mkdir&lt;/span&gt; &lt;span class="nv"&gt;$HOME&lt;/span&gt;/certs
&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s1"&gt;'export CORPORATE_CERT_DIR=$HOME/certs'&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; ~/.zshrc
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Export certificates from the keychains and concatenate them into &lt;code&gt;allCAbundle.pem&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;security &lt;span class="nb"&gt;export&lt;/span&gt; &lt;span class="nt"&gt;-t&lt;/span&gt; certs &lt;span class="nt"&gt;-f&lt;/span&gt; pemseq &lt;span class="nt"&gt;-k&lt;/span&gt; /Library/Keychains/System.keychain &lt;span class="nt"&gt;-o&lt;/span&gt; &lt;span class="nv"&gt;$CORPORATE_CERT_DIR&lt;/span&gt;/selfSignedCAbundle.pem
security &lt;span class="nb"&gt;export&lt;/span&gt; &lt;span class="nt"&gt;-t&lt;/span&gt; certs &lt;span class="nt"&gt;-f&lt;/span&gt; pemseq &lt;span class="nt"&gt;-k&lt;/span&gt; /System/Library/Keychains/SystemRootCertificates.keychain &lt;span class="nt"&gt;-o&lt;/span&gt; &lt;span class="nv"&gt;$CORPORATE_CERT_DIR&lt;/span&gt;/bundleCA.pem
&lt;span class="nb"&gt;cat&lt;/span&gt; &lt;span class="nv"&gt;$CORPORATE_CERT_DIR&lt;/span&gt;/bundleCA.pem &lt;span class="nv"&gt;$CORPORATE_CERT_DIR&lt;/span&gt;/selfSignedCAbundle.pem &lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="nv"&gt;$CORPORATE_CERT_DIR&lt;/span&gt;/allCAbundle.pem
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To inspect the file:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;cat&lt;/span&gt; &lt;span class="nv"&gt;$CORPORATE_CERT_DIR&lt;/span&gt;/allCAbundle.pem
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Using the bundle
&lt;/h3&gt;

&lt;p&gt;Append exports to &lt;code&gt;.zshrc&lt;/code&gt;. Different tools use different variables; setting several covers most cases:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s1"&gt;'export REQUESTS_CA_BUNDLE=$CORPORATE_CERT_DIR/allCAbundle.pem'&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; ~/.zshrc
&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s1"&gt;'export SSL_CERT_FILE=$CORPORATE_CERT_DIR/allCAbundle.pem'&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; ~/.zshrc
&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s1"&gt;'export CURL_CA_BUNDLE=$CORPORATE_CERT_DIR/allCAbundle.pem'&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; ~/.zshrc
&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s1"&gt;'export NODE_EXTRA_CA_CERTS=$CORPORATE_CERT_DIR/allCAbundle.pem'&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; ~/.zshrc
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After that, &lt;code&gt;az&lt;/code&gt; and typical package managers (&lt;code&gt;uv&lt;/code&gt;, &lt;code&gt;poetry&lt;/code&gt;, &lt;code&gt;npm&lt;/code&gt;, etc.) should work through the proxy.&lt;/p&gt;

&lt;p&gt;If you are off the corporate network and the proxy is disabled, you may need to &lt;code&gt;unset&lt;/code&gt; these variables (for example &lt;code&gt;unset REQUESTS_CA_BUNDLE&lt;/code&gt;) before commands like &lt;code&gt;az login&lt;/code&gt;, depending on your environment.&lt;/p&gt;

&lt;h2&gt;
  
  
  Java specifics
&lt;/h2&gt;

&lt;p&gt;The JVM uses its own trust store. You can import a PEM into a JKS with &lt;a href="https://docs.oracle.com/javase/8/docs/technotes/tools/unix/keytool.html" rel="noopener noreferrer"&gt;&lt;code&gt;keytool&lt;/code&gt;&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;You can run &lt;code&gt;keytool&lt;/code&gt; from a container so you do not install a full JDK locally if you prefer not to.&lt;/p&gt;

&lt;p&gt;The following example mounts your cert directory and runs &lt;code&gt;keytool&lt;/code&gt; inside an image that already ships Java (here &lt;code&gt;apache/spark&lt;/code&gt;; any image with &lt;code&gt;keytool&lt;/code&gt; is fine):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker run &lt;span class="nt"&gt;-v&lt;/span&gt; &lt;span class="nv"&gt;$CORPORATE_CERT_DIR&lt;/span&gt;:/opt/java/openjdk/lib/security/jssecacerts &lt;span class="nt"&gt;-it&lt;/span&gt; apache/spark keytool &lt;span class="nt"&gt;-import&lt;/span&gt; &lt;span class="nt"&gt;-v&lt;/span&gt; &lt;span class="nt"&gt;-trustcacerts&lt;/span&gt; &lt;span class="nt"&gt;-alias&lt;/span&gt; endeca-ca &lt;span class="nt"&gt;-file&lt;/span&gt; /opt/java/openjdk/lib/security/jssecacerts/my_custom_certificate.pem &lt;span class="nt"&gt;-keystore&lt;/span&gt; /opt/java/openjdk/lib/security/jssecacerts/truststore.ks
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You cannot point &lt;code&gt;keytool&lt;/code&gt; at the merged &lt;code&gt;allCAbundle.pem&lt;/code&gt; in every case—you often need a single PEM for the corporate issuing CA. Export the right certificate from Keychain Access or your IT docs.&lt;/p&gt;

&lt;p&gt;Example lookup:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;security find-certificate &lt;span class="nt"&gt;-p&lt;/span&gt; &lt;span class="nt"&gt;-c&lt;/span&gt; &lt;span class="s2"&gt;"corporate_proxy_name"&lt;/span&gt; /Library/Keychains/System.keychain
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then export it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;security find-certificate &lt;span class="nt"&gt;-a&lt;/span&gt; &lt;span class="nt"&gt;-c&lt;/span&gt; &lt;span class="s2"&gt;"corporate_proxy_name"&lt;/span&gt; &lt;span class="nt"&gt;-p&lt;/span&gt; /Library/Keychains/System.keychain &lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="nv"&gt;$CORPORATE_CERT_DIR&lt;/span&gt;/my_custom_certificate.pem
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Or obtain it from a TLS handshake (adjust host and options to match your environment):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-w&lt;/span&gt; %&lt;span class="o"&gt;{&lt;/span&gt;certs&lt;span class="o"&gt;}&lt;/span&gt; https://example.com &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="nv"&gt;$CORPORATE_CERT_DIR&lt;/span&gt;/my_custom_certificate.pem
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Import with &lt;code&gt;keytool&lt;/code&gt; as above; you will be prompted for a keystore password.&lt;/p&gt;

&lt;p&gt;Point the JVM at the JKS when you run apps, for example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nt"&gt;-Djavax&lt;/span&gt;.net.ssl.trustStore&lt;span class="o"&gt;=&lt;/span&gt;/opt/java/openjdk/lib/security/jssecacerts/truststore.ks &lt;span class="nt"&gt;-Djavax&lt;/span&gt;.net.ssl.trustStorePassword&lt;span class="o"&gt;=&lt;/span&gt;your_password
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Storing paths and passwords in environment variables (without committing secrets) keeps builds repeatable:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s1"&gt;'export CORPORATE_JKS_CERT_PATH=$CORPORATE_CERT_DIR/truststore.ks'&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; ~/.zshrc
&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s1"&gt;'export CORPORATE_JKS_CERT_PASS=your_password'&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; ~/.zshrc
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Docker and corporate TLS
&lt;/h2&gt;

&lt;h3&gt;
  
  
  At build time
&lt;/h3&gt;

&lt;p&gt;When you &lt;code&gt;docker build&lt;/code&gt;, the build may also need your CA bundle for HTTPS.&lt;/p&gt;

&lt;p&gt;You can pass files in with BuildKit &lt;a href="https://docs.docker.com/build/building/context/#additional-build-contexts" rel="noopener noreferrer"&gt;&lt;code&gt;--build-context&lt;/code&gt;&lt;/a&gt;, for example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker build &lt;span class="nt"&gt;-f&lt;/span&gt; your_dockerfile &lt;span class="nt"&gt;--build-context&lt;/span&gt; &lt;span class="nv"&gt;config&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nv"&gt;$CORPORATE_CERT_DIR&lt;/span&gt; your_docker_context
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In the Dockerfile, mount that context and set &lt;code&gt;SSL_CERT_FILE&lt;/code&gt; / related variables before &lt;code&gt;apt-get&lt;/code&gt; or similar:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight docker"&gt;&lt;code&gt;&lt;span class="k"&gt;RUN &lt;/span&gt;&lt;span class="nt"&gt;--mount&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;bind&lt;/span&gt;,from&lt;span class="o"&gt;=&lt;/span&gt;config,target&lt;span class="o"&gt;=&lt;/span&gt;/tmp/certs &lt;span class="se"&gt;\
&lt;/span&gt;    &lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;REQUESTS_CA_BUNDLE&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;/tmp/certs/allCAbundle.pem &lt;span class="se"&gt;\
&lt;/span&gt;    &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;SSL_CERT_FILE&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;/tmp/certs/allCAbundle.pem &lt;span class="se"&gt;\
&lt;/span&gt;    &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;CURL_CA_BUNDLE&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;/tmp/certs/allCAbundle.pem &lt;span class="se"&gt;\
&lt;/span&gt;    &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; apt-get update &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; apt-get &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-y&lt;/span&gt; git
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  At runtime
&lt;/h3&gt;

&lt;p&gt;If a container needs the corporate CA at runtime, mount the PEM and append it to the image trust store in an entrypoint, or bake it in during build—what follows is one illustrative pattern:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker run &lt;span class="nt"&gt;-it&lt;/span&gt; &lt;span class="nt"&gt;-v&lt;/span&gt; &lt;span class="nv"&gt;$CORPORATE_CERT_DIR&lt;/span&gt;/my_custom_certificate.pem:/certs/my_custom_certificate.pem postgres /bin/bash &lt;span class="nt"&gt;-c&lt;/span&gt; &lt;span class="s2"&gt;"cat /certs/my_custom_certificate.pem &amp;gt;&amp;gt; /etc/ssl/certs/ca-certificates.crt &amp;amp;&amp;amp; /bin/bash"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;(Adjust paths and base image to match your stack.)&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Hopefully this helps you stay compliant with corporate constraints while remaining productive.&lt;/p&gt;

&lt;p&gt;If something is unclear or a topic is missing, say so in the comments.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Update (2026-04-18):&lt;/strong&gt; Added Java and Docker notes for proxy certificates, and PHP installation. Proof reading&lt;/p&gt;

</description>
      <category>macos</category>
      <category>development</category>
      <category>tutorial</category>
      <category>productivity</category>
    </item>
    <item>
      <title>How to be Test Driven with Spark: Chapter 5: Leverage spark in a container</title>
      <dc:creator>Nicoda-27</dc:creator>
      <pubDate>Sat, 15 Mar 2025 07:33:58 +0000</pubDate>
      <link>https://forem.com/nda_27/how-to-be-test-driven-with-spark-chapter-5-leverage-spark-in-a-container-1p74</link>
      <guid>https://forem.com/nda_27/how-to-be-test-driven-with-spark-chapter-5-leverage-spark-in-a-container-1p74</guid>
      <description>&lt;p&gt;This goal of this tutorial is to provide a way to easily be test driven with spark on your local setup without using cloud resources.&lt;/p&gt;

&lt;p&gt;This is a series of tutorials and the initial chapters can be found in:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://dev.to/nda_27/how-to-be-tdd-with-spark-chapter-0-and-1-modern-python-setup-3df8"&gt;Chapter 0 and 1&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://dev.to/nda_27/how-to-be-test-driven-with-spark-2-ci-4a28"&gt;Chapter 2&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://dev.to/nda_27/how-to-be-test-driven-with-spark-chapter-3-first-spark-test-le"&gt;Chapter 3&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://dev.to/nda_27/how-to-be-test-driven-with-spark-chapter-4-leaning-into-property-based-testing-2hln"&gt;Chapter 4&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In &lt;a href="https://dev.to/nda_27/how-to-be-test-driven-with-spark-chapter-3-first-spark-test-le"&gt;chapter 3&lt;/a&gt;, it was demonstrated that the current testing approach rely on &lt;em&gt;Java&lt;/em&gt; being available on the developer setup. As mentioned, this is not ideal as there is limited control and unexpected behavior can happen. A good testing practice is to have reproducible and idempotent tests, this means:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Launching the tests an infinite number of times should always have the same results&lt;/li&gt;
&lt;li&gt;A test should leave a clean plate after it has run, there should be no side effect to a test running (no files written, no change of environment variables, no database with remaining data etc)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The reasons why it's so important, is because otherwise you will spend most of your time relaunching the tests due to false positive, you would never be sure if you actually broke something or if the test is randomly failing. At the end, you will not trust the tests anymore and skip some of them, which defeats the purpose.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why using a container?
&lt;/h2&gt;

&lt;p&gt;If you are unfamiliar with the concept of containers and docker images, I suggest you have a look at &lt;a href="https://www.docker.com/" rel="noopener noreferrer"&gt;docker&lt;/a&gt;. It will be leveraged here to start the &lt;em&gt;Spark&lt;/em&gt; server for the tests; it's important to mention there are other opensource alternatives like &lt;a href="https://podman.io/" rel="noopener noreferrer"&gt;podman&lt;/a&gt; or &lt;a href="https://github.com/containerd/nerdctl" rel="noopener noreferrer"&gt;nerdctl&lt;/a&gt; to allow containerization.&lt;/p&gt;

&lt;p&gt;Docker will be used thereafter as it has become the defacto standards for most companies, and it's available in the &lt;em&gt;Github&lt;/em&gt; ci runner. It will be assumed that you have enough knowledge about the technology to use it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Container with spark connect
&lt;/h2&gt;

&lt;p&gt;There is a small subtlety that needs to be understood. Previously, the &lt;em&gt;Java Virtual Machine (JVM)&lt;/em&gt; was used to communicate with the python spark implementation (through the &lt;code&gt;spark_session&lt;/code&gt;), it was using the java binary to create a swarm of workers that were handling the data processing. At the end, all the results were collected and communicated to the &lt;code&gt;spark_session&lt;/code&gt; which was exposing it in the python code.&lt;/p&gt;

&lt;p&gt;If you start a container with this, the &lt;code&gt;spark_session&lt;/code&gt; will never be able to find the &lt;em&gt;JVM&lt;/em&gt; inside the container as it's a binary. The container you want to create needs a way to communicate outside with the &lt;code&gt;spark_session&lt;/code&gt; through the network. Luckily, &lt;a href="https://spark.apache.org/docs/3.5.3/spark-connect-overview.html" rel="noopener noreferrer"&gt;&lt;em&gt;Spark&lt;/em&gt; connect&lt;/a&gt; is providing a solution and the documentation is a must known. This is the chosen approach to containerize the &lt;em&gt;Spark&lt;/em&gt; server and the worker creation.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Spark&lt;/em&gt; is already providing a docker &lt;a href="https://hub.docker.com/r/apache/spark" rel="noopener noreferrer"&gt;image&lt;/a&gt; that you will leverage. If you don't have docker available on your setup, you will need to install it, see the official &lt;a href="https://docs.docker.com/engine/install/ubuntu/#installation-methods" rel="noopener noreferrer"&gt;documentation&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Let's uninstall &lt;code&gt;openjdk&lt;/code&gt; to make sure &lt;code&gt;spark_session&lt;/code&gt; will use the new setup, it will require elevation of privileges:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;apt-get autoremove openjdk-8-jre
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can now relaunch the tests, it's expected that they fail with the following error:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ERROR tests/test_minimal_transfo.py::test_minimal_transfo - pyspark.errors.exceptions.base.PySparkRuntimeError: [JAVA_GATEWAY_EXITED] Java gateway process exited before sending its port number.
ERROR tests/test_minimal_transfo.py::test_transfo_w_synthetic_data - pyspark.errors.exceptions.base.PySparkRuntimeError: [JAVA_GATEWAY_EXITED] Java gateway process exited before sending its port number.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Start the container
&lt;/h2&gt;

&lt;p&gt;You will need to start the container with spark connect, you can launch&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker run &lt;span class="nt"&gt;-p&lt;/span&gt; 8081:8081 &lt;span class="nt"&gt;-e&lt;/span&gt; &lt;span class="nv"&gt;SPARK_NO_DAEMONIZE&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;True &lt;span class="nt"&gt;--name&lt;/span&gt; spark_connect apache/spark /opt/spark/sbin/start-connect-server.sh org.apache.spark.deploy.master.Master &lt;span class="nt"&gt;--packages&lt;/span&gt; org.apache.spark:spark-connect_2.12:3.5.2,io.delta:delta-core_2.12:2.3.0 &lt;span class="nt"&gt;--conf&lt;/span&gt; spark.driver.extraJavaOptions&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'-Divy.cache.dir=/tmp -Divy.home=/tmp'&lt;/span&gt; &lt;span class="nt"&gt;--conf&lt;/span&gt; spark.connect.grpc.binding.port&lt;span class="o"&gt;=&lt;/span&gt;8081
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It will print a lot in the terminal and at the end you should have:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;24/12/27 14:04:27 INFO SparkConnectServer: Spark Connect server started at: 0:0:0:0:0:0:0:0%0:8081
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This shows that the &lt;em&gt;Spark&lt;/em&gt; server is up and running.&lt;/p&gt;

&lt;p&gt;Each argument in the above command has a meaning and its importance:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;docker run&lt;/code&gt; is the docker command to start a container&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;-p 8081:8081&lt;/code&gt; is an arguments to &lt;code&gt;docker run&lt;/code&gt; that enables to use port 8081 to communicate with the created container&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;-e SPARK_NO_DAEMONIZE=True&lt;/code&gt; is an environment variable that is passed to the container creation, it's necessary to use it for the server to be created as a foreground process&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;--name spark_connect&lt;/code&gt; allows to name the created container&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;apache/spark&lt;/code&gt; is the docker image that is used, if you never used it, it will be downloaded from &lt;a href="https://hub.docker.com/r/apache/spark" rel="noopener noreferrer"&gt;&lt;em&gt;Docker Hub&lt;/em&gt;&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The rest of the command is what is called an &lt;a href="https://docs.docker.com/reference/dockerfile/#entrypoint" rel="noopener noreferrer"&gt;entrypoint&lt;/a&gt;, it's the command that will be executed inside the container. In here it contains multiple elements:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;/opt/spark/sbin/start-connect-server.sh&lt;/code&gt; is the binary of the spark server&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;org.apache.spark.deploy.master.Master&lt;/code&gt; is an argument to the binary, in here the binary is asked to deploy a Master server, the same binary can be used to deploy a Worker&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;--packages org.apache.spark:spark-connect_2.12:3.5.2,io.delta:delta-core_2.12:2.3.0&lt;/code&gt; is an optional argument to pass specific versions of spark, and delta dependencies&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;--conf spark.driver.extraJavaOptions='-Divy.cache.dir=/tmp -Divy.home=/tmp'&lt;/code&gt; is extra argument to ask the server to write to &lt;code&gt;/tmp&lt;/code&gt; inside the container, it's not a mandatory argument&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;--conf spark.connect.grpc.binding.port=8081&lt;/code&gt; is an extra argument to start the server on the port 8081 on the localhost of the container&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The last argument is where the magic happens, the server is started on port 8081, and docker is exposing the port of this container to the port of the docker host. Meaning, a spark server is now available on &lt;code&gt;http://localhost:8081&lt;/code&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Use the container
&lt;/h2&gt;

&lt;p&gt;Keep the previous terminal opened to keep the server running and open a new terminal. Now run:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pytest &lt;span class="nt"&gt;-k&lt;/span&gt; test_transfo_w_synthetic_data &lt;span class="nt"&gt;-s&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The same error should appear, indeed the &lt;code&gt;spark_session&lt;/code&gt; needs to be adapted to connect to the server you have just created. In &lt;a href="https://github.com/Nicoda-27/spark_tdd/blob/doc/chapter_5/tests/conftest.py" rel="noopener noreferrer"&gt;&lt;code&gt;test/conftest.py&lt;/code&gt;&lt;/a&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nd"&gt;@pytest.fixture&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;scope&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;session&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;spark_session&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;Generator&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;SparkSession&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="nf"&gt;yield &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;SparkSession&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;builder&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;remote&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sc://localhost:8081&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# type: ignore
&lt;/span&gt;        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;appName&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Testing PySpark Example&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getOrCreate&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Basically, it indicates the &lt;em&gt;Spark&lt;/em&gt; connect server &lt;em&gt;url&lt;/em&gt; to the &lt;em&gt;Spark&lt;/em&gt; session.&lt;/p&gt;

&lt;p&gt;And you need to add an extra dependency, which is mandatory to communicate with the spark connect server. It's worth pointing to the usage of &lt;a href="https://docs.astral.sh/uv/concepts/projects/dependencies/#optional-dependencies" rel="noopener noreferrer"&gt;extras&lt;/a&gt; in uv:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;uv add pyspark &lt;span class="nt"&gt;--extra&lt;/span&gt; connect
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;As this project is in &lt;em&gt;Python&lt;/em&gt; 3.12, another error will appear related to &lt;a href="https://stackoverflow.com/questions/69919970/no-module-named-distutils-util-but-distutils-installed/76691103#76691103" rel="noopener noreferrer"&gt;distutils&lt;/a&gt; as it was removed from the latest python version, yet some dependencies still requires it. You will have to add:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;uv add setuptools
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now you can run:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pytest &lt;span class="nt"&gt;-k&lt;/span&gt; test_minimal_transfo &lt;span class="nt"&gt;-s&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And it should run successfully, you should also see logs in the spark server in the docker run terminal.&lt;/p&gt;

&lt;h2&gt;
  
  
  Improve the container usage
&lt;/h2&gt;

&lt;p&gt;As mentioned at the beginning of this chapter, the tests need to leave a clean plate. In the previous approach, a container is still running eventhough the tests are done, it's not ideal.&lt;/p&gt;

&lt;p&gt;To improve this, you will leverage &lt;a href="https://github.com/testcontainers/testcontainers-python" rel="noopener noreferrer"&gt;testcontainers&lt;/a&gt; which empower you with easy docker creation and removal at the test level.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;uv add testcontainers &lt;span class="nt"&gt;--dev&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now, the docker can be started at the session fixture level, in &lt;a href="https://github.com/Nicoda-27/spark_tdd/blob/doc/chapter_5/tests/conftest.py" rel="noopener noreferrer"&gt;&lt;code&gt;tests/conftest.py&lt;/code&gt;&lt;/a&gt;, you can add an extra fixture:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;testcontainers.core.container&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;DockerContainer&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;testcontainers.core.waiting_utils&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;wait_for_logs&lt;/span&gt;

&lt;span class="nd"&gt;@pytest.fixture&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;scope&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;session&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;spark_connect_start&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;kwargs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;entrypoint&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/opt/spark/sbin/start-connect-server.sh org.apache.spark.deploy.master.Master --packages org.apache.spark:spark-connect_2.12:3.5.2,io.delta:delta-core_2.12:2.3.0 --conf spark.driver.extraJavaOptions=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;-Divy.cache.dir=/tmp -Divy.home=/tmp&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt; --conf spark.connect.grpc.binding.port=8081&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="nf"&gt;with &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="nc"&gt;DockerContainer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;apache/spark&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;with_bind_ports&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;8081&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;8081&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;with_env&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;SPARK_NO_DAEMONIZE&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;True&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;with_kwargs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;kwargs&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;container&lt;/span&gt;
    &lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;wait_for_logs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;container&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;SparkConnectServer: Spark Connect server started at&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;yield&lt;/span&gt; &lt;span class="n"&gt;container&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This will create a container with the previously described argument, the great thing with fixtures is that will kill the container at the end of the test execution. There is an extra step with:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;        &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;wait_for_logs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;container&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;SparkConnectServer: Spark Connect server started at&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This enforces to yield the container only when the &lt;code&gt;SparkConnectServer: Spark Connect server started at&lt;/code&gt; appeared in the container logs. It's necessary to wait for the server to be ready until it can be called.&lt;/p&gt;

&lt;p&gt;The value that is yielded is the container which also contains the server url, you need to reuse in the &lt;code&gt;spark_session&lt;/code&gt; fixture:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nd"&gt;@pytest.fixture&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;scope&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;session&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;spark_session&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;spark_connect_start&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;DockerContainer&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;Generator&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;SparkSession&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="n"&gt;ip&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;spark_connect_start&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_container_host_ip&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="nf"&gt;yield &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;SparkSession&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;builder&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;remote&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sc://&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;ip&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;:8081&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# type: ignore
&lt;/span&gt;        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;appName&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Testing PySpark Example&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getOrCreate&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can now stop the container you started before&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker stop spark_connect
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And run the tests:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pytest
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You will notice all the tests are passing, and at the end of the test session there is no running containers.&lt;/p&gt;

&lt;p&gt;The following command will show what remaining containers are still running. The spark container should not appear.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker ps &lt;span class="nt"&gt;-a&lt;/span&gt; 
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;You are now able to run local tests using spark and you can quickly iterate on your codebase and implement new features. You are no more depending on spark server to be launched for you on the cloud and waiting for it to process the data for you.&lt;/p&gt;

&lt;p&gt;The feedback loop is quicker, you are no more giving money to cloud provider for testing purposes and you provide an easy setup for developers to iterate on your project.&lt;/p&gt;

&lt;p&gt;They can launch &lt;code&gt;pytest&lt;/code&gt; and will be transparent; this also means less documentation for you to write to describe the expected developer setup.&lt;/p&gt;

&lt;p&gt;You can find the original materials in &lt;a href="https://github.com/Nicoda-27/spark_tdd/tree/doc/chapter_5" rel="noopener noreferrer"&gt;spark_tdd&lt;/a&gt;. This repository exposes what's the expected repository layout at the end of each chapter in each branch:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://github.com/Nicoda-27/spark_tdd/tree/doc/chapter_0" rel="noopener noreferrer"&gt;Chapter 0&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/Nicoda-27/spark_tdd/tree/doc/chapter_1" rel="noopener noreferrer"&gt;Chapter 1&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/Nicoda-27/spark_tdd/tree/doc/chapter_2" rel="noopener noreferrer"&gt;Chapter 2&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/Nicoda-27/spark_tdd/tree/doc/chapter_3" rel="noopener noreferrer"&gt;Chapter 3&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/Nicoda-27/spark_tdd/tree/doc/chapter_4" rel="noopener noreferrer"&gt;Chapter 4&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/Nicoda-27/spark_tdd/tree/doc/chapter_5" rel="noopener noreferrer"&gt;Chapter 5&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What's next
&lt;/h2&gt;

&lt;p&gt;Several ideas come to mind on how to improve our very small codebase&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Leverage &lt;a href="https://containers.dev/" rel="noopener noreferrer"&gt;devcontainer&lt;/a&gt; to improve ci and local development&lt;/li&gt;
&lt;li&gt;Templatize the repository for easier reusage with the help of &lt;a href="https://github.com/ffizer/ffizer" rel="noopener noreferrer"&gt;ffizer&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Explore &lt;a href="https://github.com/ibis-project/ibis?tab=readme-ov-file" rel="noopener noreferrer"&gt;ibis&lt;/a&gt; to handle multiple transformation backends transparently&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>pyspark</category>
      <category>python</category>
      <category>testing</category>
      <category>productivity</category>
    </item>
    <item>
      <title>How to be Test Driven with Spark: Chapter 4 - Leaning into Property Based Testing</title>
      <dc:creator>Nicoda-27</dc:creator>
      <pubDate>Sun, 09 Mar 2025 08:38:56 +0000</pubDate>
      <link>https://forem.com/nda_27/how-to-be-test-driven-with-spark-chapter-4-leaning-into-property-based-testing-2hln</link>
      <guid>https://forem.com/nda_27/how-to-be-test-driven-with-spark-chapter-4-leaning-into-property-based-testing-2hln</guid>
      <description>&lt;p&gt;This goal of this tutorial is to provide a way to easily be test driven with spark on your local setup without using cloud resources.&lt;/p&gt;

&lt;p&gt;This is a series of tutorials and the initial chapters can be found in:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://dev.to/nda_27/how-to-be-tdd-with-spark-chapter-0-and-1-modern-python-setup-3df8"&gt;Chapter 0 and 1&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://dev.to/nda_27/how-to-be-test-driven-with-spark-2-ci-4a28"&gt;Chapter 2&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://dev.to/nda_27/how-to-be-test-driven-with-spark-chapter-3-first-spark-test-le"&gt;Chapter 3&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The test that you implemented in &lt;a href="https://github.com/Nicoda-27/spark_tdd/blob/doc/chapter_4/tutorials/chapter_3_spark_test.md" rel="noopener noreferrer"&gt;Chapter 3&lt;/a&gt; is great, yet not complete as it takes only a limited amount of data. As spark is used to process data at scale, you have to test at scale too.&lt;/p&gt;

&lt;p&gt;There are several solutions, the first one being taking a snapshot of production data and reusing at the test level (meaning integration test or local test). The second one is to generate synthetic data based on the data schema. With the second approach, you will be leaning into a property based testing approach.&lt;/p&gt;

&lt;p&gt;The second approach will be leveraged here as the test case generation is deported to automated generation.&lt;/p&gt;

&lt;p&gt;The python ecosystem provides &lt;a href="https://hypothesis.readthedocs.io/en/latest/" rel="noopener noreferrer"&gt;&lt;em&gt;Hypothesis&lt;/em&gt;&lt;/a&gt; for proper property based testing, or &lt;a href="https://faker.readthedocs.io/en/master/" rel="noopener noreferrer"&gt;&lt;em&gt;Faker&lt;/em&gt;&lt;/a&gt; for fake data generation. &lt;em&gt;Hypothesis&lt;/em&gt; is way more powerful than Faker in the sense that it will generate test cases for you based on data property (being a string, being an integer etc) and shrink the test cases when unexpected behavior happen. &lt;em&gt;Faker&lt;/em&gt; will be used here to generate synthetic data based on business property.&lt;/p&gt;

&lt;h2&gt;
  
  
  A data driven test
&lt;/h2&gt;

&lt;p&gt;You need two new fixtures similar to &lt;code&gt;persons&lt;/code&gt; and &lt;code&gt;employments&lt;/code&gt; that will generate synthetic data. First you need to install faker as a dev dependency:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;uv add faker &lt;span class="nt"&gt;--dev&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can create &lt;code&gt;persons_synthetic&lt;/code&gt; in &lt;a href="https://github.com/Nicoda-27/spark_tdd/blob/doc/chapter_4/tests/conftest.py" rel="noopener noreferrer"&gt;&lt;code&gt;tests/conftest.py&lt;/code&gt;&lt;/a&gt; like so:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nd"&gt;@pytest.fixture&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;scope&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;session&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;persons_synthetic&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;spark_session&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;SparkSession&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;Generator&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="n"&gt;fake&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Faker&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="n"&gt;nb_elem&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;fake&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;pyint&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;100_000&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;fake&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;first_name&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;fake&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;last_name&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;fake&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;date&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;nb_elem&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="k"&gt;yield&lt;/span&gt; &lt;span class="n"&gt;spark_session&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;createDataFrame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;PersonalityName&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;PersonalitySurname&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;birth&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In the above, a data frame of 100 000 rows is generated, feel free to increase the size to generate larger data frames. Fake names, surnames and date are generated on the fly according to business needs.&lt;/p&gt;

&lt;p&gt;You can also create &lt;code&gt;employments_synthetic&lt;/code&gt; in &lt;a href="https://github.com/Nicoda-27/spark_tdd/blob/doc/chapter_4/tests/conftest.py" rel="noopener noreferrer"&gt;&lt;code&gt;tests/conftest.py&lt;/code&gt;&lt;/a&gt;, there is a dependency on &lt;code&gt;foreign_key&lt;/code&gt; from &lt;code&gt;persons_synthetic&lt;/code&gt; that needs to be handled:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nd"&gt;@pytest.fixture&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;scope&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;session&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;employments_synthetic&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;spark_session&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;SparkSession&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;persons_synthetic&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;DataFrame&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;Generator&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="n"&gt;fake&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Faker&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;persons_sample&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;persons_synthetic&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sample&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;0.8&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;person_ids_sample&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;persons_sample&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;select&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;collect_list&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)).&lt;/span&gt;&lt;span class="nf"&gt;first&lt;/span&gt;&lt;span class="p"&gt;()[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

    &lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[(&lt;/span&gt;&lt;span class="n"&gt;idx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;id_fk&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;fake&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;job&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;idx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;id_fk&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;enumerate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;person_ids_sample&lt;/span&gt;&lt;span class="p"&gt;)]&lt;/span&gt;
    &lt;span class="k"&gt;yield&lt;/span&gt; &lt;span class="n"&gt;spark_session&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;createDataFrame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;person_fk&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Employment&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;foreign_key&lt;/code&gt; is reused from a sample of &lt;code&gt;persons_synthetic&lt;/code&gt; and job name are generated on the fly.&lt;/p&gt;

&lt;p&gt;The test can now be created:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;test_transfo_w_synthetic_data&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;persons_synthetic&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;employments_synthetic&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;spark_session&lt;/span&gt;
&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;processor&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;DataProcessor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;spark_session&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;df_out&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;DataFrame&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;processor&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;persons_synthetic&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;employments_synthetic&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;df_out&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;isEmpty&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df_out&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;columns&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;surname&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;date_of_birth&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;employment&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And you can launch &lt;code&gt;pytest -k test_transfo_w_synthetic_data -s&lt;/code&gt; that should pass.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to handle slow tests
&lt;/h2&gt;

&lt;p&gt;You might notice that &lt;code&gt;test_transfo_w_synthetic_data&lt;/code&gt; is a bit slow, indeed it's generating a decent amount of data (even though far from a big data scale), modifying the data frames and joining two together.&lt;/p&gt;

&lt;p&gt;In a test driven approach, it's necessary to have a quick feedback loop to iterate quickly on your local setup. Yet, this tests needs to be launched anyway as they validate behavior with decent amount of data.&lt;/p&gt;

&lt;p&gt;A solution is to add tags to tests like so:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;pytest&lt;/span&gt;
&lt;span class="bp"&gt;...&lt;/span&gt;

&lt;span class="nd"&gt;@pytest.mark.slow&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;test_transfo_w_synthetic_data&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;persons_synthetic&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;employments_synthetic&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;spark_session&lt;/span&gt;
&lt;span class="p"&gt;):&lt;/span&gt;
&lt;span class="bp"&gt;...&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This tag can be leveraged by pytest to filter out tests at execution time, see &lt;a href="https://docs.pytest.org/en/stable/example/markers.html#mark-examples" rel="noopener noreferrer"&gt;documentation&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;and add to &lt;a href="https://github.com/Nicoda-27/spark_tdd/blob/doc/chapter_4/pyproject.toml" rel="noopener noreferrer"&gt;&lt;code&gt;pyproject.toml&lt;/code&gt;&lt;/a&gt; the expected markers for &lt;em&gt;Pytest&lt;/em&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight toml"&gt;&lt;code&gt;&lt;span class="nn"&gt;[tool.pytest.ini_options]&lt;/span&gt;
&lt;span class="py"&gt;pythonpath&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"src"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="py"&gt;markers&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"slow"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Pytest&lt;/em&gt; is now aware of this new marker when launching:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pytest &lt;span class="nt"&gt;--markers&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can now launch:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pytest &lt;span class="nt"&gt;-m&lt;/span&gt; &lt;span class="s2"&gt;"not slow"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It will validate only the tests not marked as slow.&lt;/p&gt;

&lt;p&gt;In the ci, there is nothing to change as by default &lt;em&gt;Pytest&lt;/em&gt; will launch all the test.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's next?
&lt;/h2&gt;

&lt;p&gt;On the next chapter, the next chapter will focus on test repeatability by improving how java is used for &lt;em&gt;Spark&lt;/em&gt; at the test level.&lt;/p&gt;

&lt;p&gt;You can find the original materials in &lt;a href="https://github.com/Nicoda-27/spark_tdd/tree/doc/chapter_2" rel="noopener noreferrer"&gt;spark_tdd&lt;/a&gt;. This repository exposes what's the expected repository layout at the end of each chapter in each branch:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://github.com/Nicoda-27/spark_tdd/tree/doc/chapter_0" rel="noopener noreferrer"&gt;Chapter 0&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/Nicoda-27/spark_tdd/tree/doc/chapter_1" rel="noopener noreferrer"&gt;Chapter 1&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/Nicoda-27/spark_tdd/tree/doc/chapter_2" rel="noopener noreferrer"&gt;Chapter 2&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/Nicoda-27/spark_tdd/tree/doc/chapter_3" rel="noopener noreferrer"&gt;Chapter 3&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;[15/03/25 UPDATE]: &lt;a href="https://dev.to/nda_27/how-to-be-test-driven-with-spark-chapter-5-leverage-spark-in-a-container-1p74"&gt;Chapter 5&lt;/a&gt; has been released&lt;br&gt;
[18/04/26 UPDATE]: &lt;a href="https://dev.to/nda_27/how-to-be-test-driven-with-spark-chapter-6-improve-the-setup-using-devcontainer-5dj8"&gt;Chapter 6&lt;/a&gt; has been released&lt;/p&gt;

</description>
      <category>python</category>
      <category>pyspark</category>
      <category>testing</category>
      <category>productivity</category>
    </item>
    <item>
      <title>How to be Test Driven with Spark: Chapter 3 - First Spark test</title>
      <dc:creator>Nicoda-27</dc:creator>
      <pubDate>Sat, 01 Mar 2025 08:01:00 +0000</pubDate>
      <link>https://forem.com/nda_27/how-to-be-test-driven-with-spark-chapter-3-first-spark-test-le</link>
      <guid>https://forem.com/nda_27/how-to-be-test-driven-with-spark-chapter-3-first-spark-test-le</guid>
      <description>&lt;p&gt;This goal of this tutorial is to provide a way to easily be test driven with spark on your local setup without using cloud resources.&lt;/p&gt;

&lt;p&gt;This is a series of tutorials and the initial chapters can be found in:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt; &lt;a href="https://dev.to/nda_27/how-to-be-tdd-with-spark-chapter-0-and-1-modern-python-setup-3df8"&gt;Chapter 0 and 1&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt; &lt;a href="https://dev.to/nda_27/how-to-be-test-driven-with-spark-2-ci-4a28"&gt;Chapter 2&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Chapter 3: Implement a first test with &lt;em&gt;spark&lt;/em&gt;
&lt;/h2&gt;

&lt;p&gt;This chapter will focus on implementing a first &lt;em&gt;spark&lt;/em&gt; data manipulation with an associated test. It will go through the issues that will be encountered and how to solve them.&lt;/p&gt;

&lt;h3&gt;
  
  
  The data
&lt;/h3&gt;

&lt;p&gt;A dummy use case is used to demonstrate the workflow.&lt;/p&gt;

&lt;p&gt;The scenario is that production data is made of two tables &lt;code&gt;persons&lt;/code&gt; and &lt;code&gt;employments&lt;/code&gt; with the following schema and data types. Here is a sample of the data.&lt;/p&gt;

&lt;h4&gt;
  
  
  Persons
&lt;/h4&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;id: int&lt;/th&gt;
&lt;th&gt;PersonalityName: str&lt;/th&gt;
&lt;th&gt;PersonalitySurname: str&lt;/th&gt;
&lt;th&gt;birth: datetime(str)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;George&lt;/td&gt;
&lt;td&gt;Washington&lt;/td&gt;
&lt;td&gt;1732-02-22&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;Henry&lt;/td&gt;
&lt;td&gt;Ford&lt;/td&gt;
&lt;td&gt;1863-06-30&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;Benjamin&lt;/td&gt;
&lt;td&gt;Franklin&lt;/td&gt;
&lt;td&gt;1706-01-17&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;Martin&lt;/td&gt;
&lt;td&gt;Luther King Jr.&lt;/td&gt;
&lt;td&gt;1929-01-15&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h4&gt;
  
  
  Employments
&lt;/h4&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;id: int&lt;/th&gt;
&lt;th&gt;person_fk: int&lt;/th&gt;
&lt;th&gt;Employment&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;president&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;industrialist&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;inventor&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;minister&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The goal is to change the names of the columns and to join the data. The data here is just a sample, it's overkill to use &lt;em&gt;spark&lt;/em&gt; to process data like this. Yet, in a big data context, you need to foresee that the data will contains more lines and more complex joins. The sample is just here as a demonstration.&lt;/p&gt;

&lt;h3&gt;
  
  
  The dummy test
&lt;/h3&gt;

&lt;p&gt;First, you need to add spark dependencies&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;uv add pyspark
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Before diving into the implementation, you need to make sure you can reproduce a very simple use case. It's not worth diving into complex data manipulation if you are not able to reproduce simple documentation snippet.&lt;/p&gt;

&lt;p&gt;You will write your first test &lt;a href="https://github.com/Nicoda-27/spark_tdd/blob/doc/chapter_3/tests/test_minimal_transfo.py" rel="noopener noreferrer"&gt;&lt;code&gt;test_minimal_transfo.py&lt;/code&gt;&lt;/a&gt;. You will try first to use &lt;em&gt;pyspark&lt;/em&gt; to do simple data frame creation.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pyspark.sql&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;SparkSession&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;test_minimal_transfo&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;SparkSession&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;SparkSession&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;builder&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;master&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;local&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;appName&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Testing PySpark Example&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getOrCreate&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;createDataFrame&lt;/span&gt;&lt;span class="p"&gt;([(&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)],&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;col1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;col2&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
    &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;show&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The first part with the session create or fetch a local &lt;em&gt;spark&lt;/em&gt; session, the second part leverages the session to create a data frame.&lt;/p&gt;

&lt;p&gt;Then you can launch:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pytest &lt;span class="nt"&gt;-k&lt;/span&gt; test_minimal_transfo &lt;span class="nt"&gt;-s&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you have a minimal developer setup, it should not work because it's trying to use &lt;em&gt;Java&lt;/em&gt; which you might be missing and the following error will be displayed:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;FAILED tests/test_minimal_transfo.py::test_minimal_transfo - pyspark.errors.exceptions.base.PySparkRuntimeError: [JAVA_GATEWAY_EXITED] Java gateway process exited before sending its port number
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It's a bit annoying, because you need to have &lt;em&gt;Java&lt;/em&gt; installed on our &lt;em&gt;dev&lt;/em&gt; setup, the ci setup and all your collaborators setup. On the future chapters, a better alternative will be described.&lt;/p&gt;

&lt;p&gt;There are different flavors of &lt;em&gt;Java&lt;/em&gt;, you can simply install the &lt;a href="https://openjdk.org/" rel="noopener noreferrer"&gt;&lt;em&gt;openjdk&lt;/em&gt;&lt;/a&gt; one. It will require elevation of privileges:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;apt-get &lt;span class="nb"&gt;install &lt;/span&gt;openjdk-8-jre
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can now relaunch:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pytest &lt;span class="nt"&gt;-k&lt;/span&gt; test_minimal_transfo &lt;span class="nt"&gt;-s&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;and it should display&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;+----+----+                                                                     
|col1|col2|
+----+----+
|   3|   4|
|   1|   2|
+----+----+
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is a small victory, but you can now use a local &lt;em&gt;spark&lt;/em&gt; session to manage data frames, yay !&lt;/p&gt;

&lt;h2&gt;
  
  
  The real test case - version 0
&lt;/h2&gt;

&lt;p&gt;On the previous sample, it shows that the &lt;code&gt;spark session&lt;/code&gt; plays a pivotal role, it will be instantiated differently in the tests context than in the production context.&lt;/p&gt;

&lt;p&gt;This means we can leverage a &lt;em&gt;pytest&lt;/em&gt; fixture to be reused for all tests later on; it can be created at the session level so there is only one spark session for the whole test suite. Meaning, you can create a &lt;a href="//../tests/conftest.py"&gt;&lt;code&gt;tests/conftest.py&lt;/code&gt;&lt;/a&gt; to factorize common behavior. If you are not familiar with pytest and fixtures, it's advised to have a look at &lt;a href="https://docs.pytest.org/en/6.2.x/fixture.html" rel="noopener noreferrer"&gt;documentation&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;In &lt;a href="https://github.com/Nicoda-27/spark_tdd/blob/doc/chapter_3/tests/conftest.py" rel="noopener noreferrer"&gt;&lt;code&gt;tests/conftest.py&lt;/code&gt;&lt;/a&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;typing&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Generator&lt;/span&gt;

&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;pytest&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pyspark.sql&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;SparkSession&lt;/span&gt;


&lt;span class="nd"&gt;@pytest.fixture&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;scope&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;session&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;spark_session&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;Generator&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;SparkSession&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="nf"&gt;yield &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;SparkSession&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;builder&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;master&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;local&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;appName&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Testing PySpark Example&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getOrCreate&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then, it can be reused in &lt;a href="https://github.com/Nicoda-27/spark_tdd/blob/doc/chapter_3/tests/test_minimal_transfo.py" rel="noopener noreferrer"&gt;&lt;code&gt;tests/test_minimal_transo.py&lt;/code&gt;&lt;/a&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pyspark.sql&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;SparkSession&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;test_minimal_transfo&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;spark_session&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;SparkSession&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;

    &lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;spark_session&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;createDataFrame&lt;/span&gt;&lt;span class="p"&gt;([(&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)],&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;col1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;col2&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
    &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;show&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can again run &lt;code&gt;pytest -k test_minimal_transfo -s&lt;/code&gt; to check the behavior has not changed. It's important in a test driven approach to keep launching the tests after code modification to ensure nothing was broken.&lt;/p&gt;

&lt;p&gt;To be closer to the business context, you can implement a data transformation object. There will be a clear separation between data generation and data transformation. You can do so in &lt;a href="https://github.com/Nicoda-27/spark_tdd/blob/doc/chapter_3/src/pyspark_tdd/data_processor.py" rel="noopener noreferrer"&gt;&lt;code&gt;src/data_transform.py&lt;/code&gt;&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pyspark.sql&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;SparkSession&lt;/span&gt;


&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;DataProcessor&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;spark_session&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;SparkSession&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;spark_session&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;spark_session&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;persons&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;employments&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nb"&gt;NotImplementedError&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now, there is a prototype for &lt;code&gt;DataProcessor&lt;/code&gt;, the tests can be improved to actually assert on elements like so in &lt;a href="https://github.com/Nicoda-27/spark_tdd/blob/doc/chapter_3/tests/test_minimal_transfo.py" rel="noopener noreferrer"&gt;&lt;code&gt;test_minimal_transfo.py&lt;/code&gt;&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pyspark.sql&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;SparkSession&lt;/span&gt;

&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pyspark_tdd.data_processor&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;DataProcessor&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;test_minimal_transfo&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;spark_session&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;SparkSession&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;persons&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;spark_session&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;createDataFrame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="p"&gt;[&lt;/span&gt;
            &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;George&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Washington&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1732-02-22&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Henry&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Ford&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1863-06-30&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Benjamin&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Franklin&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1706-01-17&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Martin&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Luther King Jr.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1929-01-15&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;PersonalityName&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;PersonalitySurname&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;birth&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;employments&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;spark_session&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;createDataFrame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="p"&gt;[&lt;/span&gt;
            &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;president&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;industrialist&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;inventor&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;minister&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;person_fk&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Employment&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;processor&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;DataProcessor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;spark_session&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;df_out&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;DataFrame&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;processor&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;persons&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;employments&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;df_out&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;isEmpty&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df_out&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;columns&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;surname&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;date_of_birth&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;employment&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The example above will ensure that the data frame fits some criteria, but it will raise an &lt;code&gt;NotImplementedError&lt;/code&gt; as you have to implement the actual data processing. It's intended, the actual processing code can be created after testing is properly setup.&lt;/p&gt;

&lt;p&gt;The actual test is still not ideal as test case generation is part of the test itself. &lt;em&gt;Pytest&lt;/em&gt; &lt;a href="https://docs.pytest.org/en/stable/how-to/parametrize.html" rel="noopener noreferrer"&gt;parametrization&lt;/a&gt; can be leveraged:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;pytest&lt;/span&gt; 

&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pyspark.sql&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;SparkSession&lt;/span&gt;

&lt;span class="nd"&gt;@pytest.mark.parametrize&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;persons,employments&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="p"&gt;(&lt;/span&gt;
                &lt;span class="p"&gt;[&lt;/span&gt;
                    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;George&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Washington&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1732-02-22&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
                    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Henry&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Ford&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1863-06-30&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
                    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Benjamin&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Franklin&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1706-01-17&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
                    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Martin&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Luther King Jr.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1929-01-15&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
                &lt;span class="p"&gt;],&lt;/span&gt;
                &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;PersonalityName&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;PersonalitySurname&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;birth&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
            &lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="p"&gt;(&lt;/span&gt;
                &lt;span class="p"&gt;[&lt;/span&gt;
                    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;president&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
                    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;industrialist&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
                    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;inventor&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
                    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;minister&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
                &lt;span class="p"&gt;],&lt;/span&gt;
                &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;person_fk&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Employment&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
            &lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;],&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;test_minimal_transfo&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;spark_session&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;SparkSession&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;persons&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;employments&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;persons&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;spark_session&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;createDataFrame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;persons&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;employments&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;spark_session&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;createDataFrame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;employments&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;processor&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;DataProcessor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;spark_session&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;df_out&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;DataFrame&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;processor&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;persons&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;employments&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;df_out&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;isEmpty&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df_out&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;columns&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;surname&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;date_of_birth&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;employment&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The above example show how test cases generation can be separated from test runs. It allows to see at first glance what this test is about without noise about test data. Most likely, the test data frames could be reused in another test, it needs to be refactored again. The test part becomes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pyspark.sql&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;SparkSession&lt;/span&gt;

&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pyspark_tdd.data_processor&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;DataProcessor&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;test_minimal_transfo&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;spark_session&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;SparkSession&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;persons&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;employments&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;processor&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;DataProcessor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;spark_session&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;df_out&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;DataFrame&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;processor&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;persons&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;employments&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;df_out&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;isEmpty&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df_out&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;columns&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;surname&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;date_of_birth&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;employment&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;and two fixtures &lt;code&gt;persons&lt;/code&gt; and &lt;code&gt;employments&lt;/code&gt; are created in &lt;a href="https://github.com/Nicoda-27/spark_tdd/blob/doc/chapter_3/tests/conftest.py" rel="noopener noreferrer"&gt;&lt;code&gt;tests/conftest.py&lt;/code&gt;&lt;/a&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nd"&gt;@pytest.fixture&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;scope&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;session&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;persons&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;spark_session&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;SparkSession&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;Generator&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="k"&gt;yield&lt;/span&gt; &lt;span class="n"&gt;spark_session&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;createDataFrame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="p"&gt;[&lt;/span&gt;
            &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;George&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Washington&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1732-02-22&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Henry&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Ford&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1863-06-30&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Benjamin&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Franklin&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1706-01-17&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Martin&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Luther King Jr.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1929-01-15&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;PersonalityName&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;PersonalitySurname&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;birth&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;


&lt;span class="nd"&gt;@pytest.fixture&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;scope&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;session&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;employments&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;spark_session&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;SparkSession&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;Generator&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="k"&gt;yield&lt;/span&gt; &lt;span class="n"&gt;spark_session&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;createDataFrame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="p"&gt;[&lt;/span&gt;
            &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;president&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;industrialist&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;inventor&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;minister&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;person_fk&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Employment&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can now relaunch &lt;code&gt;pytest -k test_minimal_transfo -s&lt;/code&gt; and notice the &lt;code&gt;NotImplementedError&lt;/code&gt; being raised; which is a good thing. The code has changed 3 times, yet the behavior remains the same, and the tests confirm it.&lt;/p&gt;

&lt;h2&gt;
  
  
  The real test case - version 1
&lt;/h2&gt;

&lt;p&gt;Now that there is a proper testing in place, source code can be implemented. There could be variations of this, the intent here is not to provide the best source code, but the best way to test:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pyspark.sql&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;SparkSession&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pyspark.sql.functions&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;to_date&lt;/span&gt;


&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;DataProcessor&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;spark_session&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;SparkSession&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;spark_session&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;spark_session&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;persons_rename&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;PersonalityName&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;PersonalitySurname&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;surname&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;birth&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;date_of_birth&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;employments_rename&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Employment&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;employment&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;persons&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;employments&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;persons&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;persons&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;withColumn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;birth&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;to_date&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;persons&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;birth&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;withColumnsRenamed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;colsMap&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;persons_rename&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="n"&gt;employments&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;employments&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;withColumnRenamed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;colsMap&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;employments_rename&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;joined&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;persons&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;employments&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;persons&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;employments&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;person_fk&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;how&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;left&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;joined&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;joined&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;drop&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;person_fk&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;joined&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you rerun &lt;code&gt;pytest -k test_minimal_transfo -s&lt;/code&gt;, then the test is successful.&lt;/p&gt;

&lt;h3&gt;
  
  
  What about ci?
&lt;/h3&gt;

&lt;p&gt;A strong dependency to &lt;em&gt;Java&lt;/em&gt; is now in place, running the tests in ci will depend on the ci having &lt;em&gt;Java&lt;/em&gt; installed or not. This is an issue because it requires the developer to have a defined &lt;em&gt;dev&lt;/em&gt; setup outside of the python ecosystem, there are extra steps for anyone to launch the tests.&lt;/p&gt;

&lt;p&gt;Keep in mind, there is limited control over the developer setup, what if the &lt;em&gt;Java&lt;/em&gt; already installed in the developer setup is not spark compliant? It will then be frustrating for the developer to investigate and most likely reinstall another &lt;em&gt;Java&lt;/em&gt; version which might impact other projects. See the mess&lt;/p&gt;

&lt;p&gt;Luckily, the ci runner on &lt;em&gt;Github&lt;/em&gt; has &lt;em&gt;Java&lt;/em&gt; installed for us; so the ci should run.&lt;/p&gt;

&lt;h3&gt;
  
  
  Clean up
&lt;/h3&gt;

&lt;p&gt;You can now also clean up the repository to have a clean plate. For instance, &lt;code&gt;src/pyspark_tdd/multiply.py&lt;/code&gt; and &lt;code&gt;tests/test_dummy.py&lt;/code&gt; can be removed.&lt;/p&gt;

&lt;h3&gt;
  
  
  What's next
&lt;/h3&gt;

&lt;p&gt;Now, you have a comfortable setup to modify and tweak the code. You can run the tests and be sure to reproduce.&lt;/p&gt;

&lt;p&gt;In the next chapter, a more data driven approach to test case generation will be explored.&lt;/p&gt;

&lt;p&gt;You can find the original materials in &lt;a href="https://github.com/Nicoda-27/spark_tdd/tree/doc/chapter_3" rel="noopener noreferrer"&gt;spark_tdd&lt;/a&gt;. This repository exposes what's the expected repository layout at the end of each chapter in each branch:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://github.com/Nicoda-27/spark_tdd/tree/doc/chapter_0" rel="noopener noreferrer"&gt;Chapter 0&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/Nicoda-27/spark_tdd/tree/doc/chapter_1" rel="noopener noreferrer"&gt;Chapter 1&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/Nicoda-27/spark_tdd/tree/doc/chapter_2" rel="noopener noreferrer"&gt;Chapter 2&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/Nicoda-27/spark_tdd/tree/doc/chapter_3" rel="noopener noreferrer"&gt;Chapter 3&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;[09/03/25 UPDATE]: &lt;a href="https://dev.to/nda_27/how-to-be-test-driven-with-spark-chapter-4-leaning-into-property-based-testing-2hln"&gt;Chapter 4&lt;/a&gt; has been released&lt;br&gt;
[15/03/25 UPDATE]: &lt;a href="https://dev.to/nda_27/how-to-be-test-driven-with-spark-chapter-5-leverage-spark-in-a-container-1p74"&gt;Chapter 5&lt;/a&gt; has been released&lt;/p&gt;

</description>
      <category>spark</category>
      <category>python</category>
      <category>testing</category>
      <category>productivity</category>
    </item>
    <item>
      <title>How to be Test Driven with Spark: Chapter 2 - CI</title>
      <dc:creator>Nicoda-27</dc:creator>
      <pubDate>Sun, 23 Feb 2025 10:21:54 +0000</pubDate>
      <link>https://forem.com/nda_27/how-to-be-test-driven-with-spark-2-ci-4a28</link>
      <guid>https://forem.com/nda_27/how-to-be-test-driven-with-spark-2-ci-4a28</guid>
      <description>&lt;p&gt;This goal of this tutorial is to provide a way to easily be test driven with spark on your local setup without using cloud resources.&lt;/p&gt;

&lt;p&gt;This is a series of tutorials and the initial chapters can be found in:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt; &lt;a href="https://dev.to/nda_27/how-to-be-tdd-with-spark-chapter-0-and-1-modern-python-setup-3df8"&gt;Chapter 0 and 1&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Chapter 2: Continuous Integration (ci)
&lt;/h2&gt;

&lt;p&gt;Having a ci is mandatory for any project that aims at having multiple contributors. In the following chapter, a proposal ci will be implemented.&lt;/p&gt;

&lt;p&gt;As ci implementation is specific to a collaborative platform being &lt;em&gt;Github&lt;/em&gt;, &lt;em&gt;Gitlab&lt;/em&gt;, &lt;em&gt;Bitbucket&lt;/em&gt;, &lt;em&gt;Azure Devops&lt;/em&gt; etc. The following chapter will try to provide a technology agnostic ci as much as possible.&lt;/p&gt;

&lt;p&gt;Similar concepts are available in all ci, you will have to transpose the concepts that will be used here.&lt;/p&gt;

&lt;h3&gt;
  
  
  Content of the ci
&lt;/h3&gt;

&lt;p&gt;The ci here will be very minimal but showcases concepts that you implemented in &lt;a href="https://github.com/Nicoda-27/spark_tdd/blob/doc/chapter_2/tutorials/chapter_1_setup.md" rel="noopener noreferrer"&gt;Chapter 1&lt;/a&gt;, namely:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Python setup&lt;/li&gt;
&lt;li&gt;Project setup&lt;/li&gt;
&lt;li&gt;Code Formatting&lt;/li&gt;
&lt;li&gt;Test automation&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;There are many more addition to the continuous integration that will not be tackled here. A minimal ci is required to guarantee non regressions in terms of:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;code styling rules to guarantee no indivual contributors diverge from the coding style&lt;/li&gt;
&lt;li&gt;tests, namely all tests must be passing&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Implementation
&lt;/h3&gt;

&lt;p&gt;&lt;em&gt;Github&lt;/em&gt; provides extensive &lt;a href="https://github.com/features/actions" rel="noopener noreferrer"&gt;documentation&lt;/a&gt; for you to tweak your ci.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Github&lt;/em&gt; is expecting ci files to be provided at a specific location, you can therefore create a file in &lt;a href="https://github.com/Nicoda-27/spark_tdd/blob/doc/chapter_2/.github/workflows/ci.yaml" rel="noopener noreferrer"&gt;&lt;code&gt;.github/workflows/ci.yaml&lt;/code&gt;&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;In this file, you can add&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Continuous Integration&lt;/span&gt;
&lt;span class="na"&gt;run-name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Continuous Integration&lt;/span&gt;
&lt;span class="na"&gt;on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;push&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
&lt;span class="na"&gt;jobs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;Continuous-Integration&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;runs-on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ubuntu-latest&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;The &lt;code&gt;name&lt;/code&gt; and &lt;code&gt;run-name&lt;/code&gt; define the names of the pipeline that will run.&lt;/li&gt;
&lt;li&gt;The &lt;code&gt;on&lt;/code&gt; defines the event that will trigger the pipeline to run, &lt;code&gt;push&lt;/code&gt; means that for every commit the pipeline will run.&lt;/li&gt;
&lt;li&gt;The &lt;code&gt;jobs&lt;/code&gt; defines a list of jobs, the ci is made of one job with multiple steps for the sake of simplicity.&lt;/li&gt;
&lt;li&gt;The &lt;code&gt;runs-on&lt;/code&gt; defined the docker image used to run (the runner) the environment against, it's a list of &lt;a href="https://github.com/actions/runner-images" rel="noopener noreferrer"&gt;docker images&lt;/a&gt; maintained by &lt;em&gt;Github&lt;/em&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Now into the steps section we can add:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Check out repository code&lt;/span&gt;
    &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/checkout@v4&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;jdx/mise-action@v2&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Run Formatting&lt;/span&gt;
    &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
      &lt;span class="s"&gt;uv run ruff check&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Run Tests&lt;/span&gt;
    &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
      &lt;span class="s"&gt;uv run pytest&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;The &lt;code&gt;actions/checkout@v4&lt;/code&gt; is the &lt;em&gt;Github&lt;/em&gt; action that checkout the current branch of the repository.&lt;/li&gt;
&lt;li&gt;The &lt;code&gt;jdx/mise-action@v2&lt;/code&gt; is the &lt;em&gt;Github&lt;/em&gt; action that will read the &lt;code&gt;mise.toml&lt;/code&gt; and install everything for us.&lt;/li&gt;
&lt;li&gt;The &lt;code&gt;Run Formatting&lt;/code&gt; step will install the dependencies and run the formatting. It there is an error, the command will fail and the pipeline too.&lt;/li&gt;
&lt;li&gt;The &lt;code&gt;Run Tests&lt;/code&gt; step will run the tests. It there is an error, the command will fail and the pipeline too.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Ci as documentation
&lt;/h3&gt;

&lt;p&gt;As it was stated, the ci is the only source of truth. If it passes on ci, it should pass on your local setup. If not, it means there are discrepancies between the ci setup and yours.&lt;/p&gt;

&lt;p&gt;Going through the ci implementation will help you on reproducibility. Maybe you're not using the same way to install python version, or the same dependency management tool. You need to align your tools and the ones presented in &lt;a href="https://github.com/Nicoda-27/spark_tdd/blob/doc/chapter_2/tutorials/chapter_1_setup.md" rel="noopener noreferrer"&gt;chapter 1&lt;/a&gt; help not to conflict with your local setup. You might have installed python package globally or you might have manually changed &lt;code&gt;PYTHON_HOME&lt;/code&gt; or your &lt;code&gt;PATH&lt;/code&gt; and this can easily be a mess.&lt;/p&gt;

&lt;p&gt;To help on reproducibility, a &lt;a href="https://code.visualstudio.com/docs/devcontainers/containers" rel="noopener noreferrer"&gt;dev container&lt;/a&gt; approach can be used. It means, the ci will run inside a container and this container can be reused as a developer environment. This will not be implemented for the moment.&lt;/p&gt;

&lt;h3&gt;
  
  
  A better ci structure
&lt;/h3&gt;

&lt;p&gt;To improve readability and segregates between code formatting and testing, &lt;em&gt;Github&lt;/em&gt; actions can be implemented as job with interdependencies. Then, the workflow becomes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Continuous Integration&lt;/span&gt;
&lt;span class="na"&gt;run-name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Continuous Integration&lt;/span&gt;
&lt;span class="na"&gt;on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;push&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
&lt;span class="na"&gt;jobs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;Formatting&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;runs-on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ubuntu-latest&lt;/span&gt;
    &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Check out repository code&lt;/span&gt;
        &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/checkout@v4&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;jdx/mise-action@v2&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Run Formatting&lt;/span&gt;
        &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
          &lt;span class="s"&gt;uv run ruff check&lt;/span&gt;
  &lt;span class="na"&gt;Tests&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;runs-on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ubuntu-latest&lt;/span&gt;
    &lt;span class="na"&gt;needs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;Formatting&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
    &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Check out repository code&lt;/span&gt;
        &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/checkout@v4&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;jdx/mise-action@v2&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Run Tests&lt;/span&gt;
        &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
          &lt;span class="s"&gt;uv run pytest&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In here we added the &lt;code&gt;needs: [Formatting]&lt;/code&gt; to create dependencies between ci job. It means, we will not run the tests until the code style is compliant; this will save some time and resources. Indeed, if the code is not formatted, don't even bother running the tests. The execution graph will be like:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F946pqc92i72xoltqjdw0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F946pqc92i72xoltqjdw0.png" alt="Ci Execution graph" width="800" height="170"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We can see here some duplication, which is not ideal as for future code improvements, you will have to do it at two places at the same time. This is technical debt that one would have to tackle using &lt;a href="https://docs.github.com/en/actions/sharing-automations/creating-actions/creating-a-composite-action" rel="noopener noreferrer"&gt;composite action&lt;/a&gt;. We will consider it's ok for now.&lt;/p&gt;

&lt;h3&gt;
  
  
  Caching dependency resolution
&lt;/h3&gt;

&lt;p&gt;You will see additional steps in the &lt;a href="https://github.com/Nicoda-27/spark_tdd/blob/doc/chapter_2/.github/workflows/ci.yaml" rel="noopener noreferrer"&gt;&lt;code&gt;ci.yaml&lt;/code&gt;&lt;/a&gt;, namely related to cache&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Restore uv cache&lt;/span&gt;
        &lt;span class="s"&gt;uses&lt;/span&gt;&lt;span class="err"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/cache@v4&lt;/span&gt;
        &lt;span class="s"&gt;with&lt;/span&gt;&lt;span class="err"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;path&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;/tmp/.uv-cache&lt;/span&gt;
          &lt;span class="na"&gt;key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;uv-${{ runner.os }}-${{ hashFiles('uv.lock') }}&lt;/span&gt;
          &lt;span class="na"&gt;restore-keys&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
            &lt;span class="s"&gt;uv-${{ runner.os }}-${{ hashFiles('uv.lock') }}&lt;/span&gt;
            &lt;span class="s"&gt;uv-${{ runner.os }}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;These steps aim at caching the &lt;code&gt;.venv&lt;/code&gt; when there are no changes on the &lt;code&gt;uv.lock&lt;/code&gt; and reusing it. The intent is to speed up the ci execution as dependency resolution and installation can be time consuming.&lt;/p&gt;

&lt;p&gt;An extra step to minimize caching size is added as &lt;em&gt;mise&lt;/em&gt; proposes such feature, namely an extra step and an environment variable is added to configure the location of the cache.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Minimize uv cache&lt;/span&gt;
        &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;uv cache prune --ci&lt;/span&gt;
    &lt;span class="na"&gt;env&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;UV_CACHE_DIR&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;/tmp/.uv-cache&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  What's next
&lt;/h3&gt;

&lt;p&gt;On the next chapter, you will implement your first spark code and implement a way to guarantee test automation of it. This is long overdue as we spent 3 chapters on setup...&lt;/p&gt;

&lt;p&gt;You can find the original materials in &lt;a href="https://github.com/Nicoda-27/spark_tdd/tree/doc/chapter_2" rel="noopener noreferrer"&gt;spark_tdd&lt;/a&gt;. This repository exposes what's the expected repository layout at the end of each chapter in each branch:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://github.com/Nicoda-27/spark_tdd/tree/doc/chapter_0" rel="noopener noreferrer"&gt;Chapter 0&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/Nicoda-27/spark_tdd/tree/doc/chapter_1" rel="noopener noreferrer"&gt;Chapter 1&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/Nicoda-27/spark_tdd/tree/doc/chapter_2" rel="noopener noreferrer"&gt;Chapter 2&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;[03/03/25 UPDATE]: &lt;a href="https://dev.to/nda_27/how-to-be-test-driven-with-spark-chapter-3-first-spark-test-le"&gt;Chapter 3&lt;/a&gt; has been released&lt;br&gt;
[09/03/25 UPDATE]: &lt;a href="https://dev.to/nda_27/how-to-be-test-driven-with-spark-chapter-4-leaning-into-property-based-testing-2hln"&gt;Chapter 4&lt;/a&gt; has been released&lt;br&gt;
[15/03/25 UPDATE]: &lt;a href="https://dev.to/nda_27/how-to-be-test-driven-with-spark-chapter-5-leverage-spark-in-a-container-1p74"&gt;Chapter 5&lt;/a&gt; has been released&lt;/p&gt;

</description>
      <category>python</category>
      <category>ci</category>
      <category>testing</category>
      <category>productivity</category>
    </item>
    <item>
      <title>How to be Test Driven with Spark: Chapter 0 and 1 - Modern Python Setup</title>
      <dc:creator>Nicoda-27</dc:creator>
      <pubDate>Sat, 15 Feb 2025 09:24:09 +0000</pubDate>
      <link>https://forem.com/nda_27/how-to-be-tdd-with-spark-chapter-0-and-1-modern-python-setup-3df8</link>
      <guid>https://forem.com/nda_27/how-to-be-tdd-with-spark-chapter-0-and-1-modern-python-setup-3df8</guid>
      <description>&lt;h2&gt;
  
  
  Chapter 0: Why this tutorial
&lt;/h2&gt;

&lt;p&gt;This goal of this tutorial is to provide a way to easily be test driven with spark on your local setup without using cloud resources.&lt;/p&gt;

&lt;p&gt;Before deep diving into spark and how, we must first align on our setup environment to ease reproducibility; this will be the focus of this article.&lt;/p&gt;

&lt;p&gt;The official &lt;a href="https://spark.apache.org/docs/latest/api/python/getting_started/testing_pyspark.html#Putting-It-All-Together!" rel="noopener noreferrer"&gt;documentation&lt;/a&gt; describes how to create tests with pyspark.&lt;/p&gt;

&lt;p&gt;It requires to have spark server with a spark connect support for it to work as described in the &lt;a href="https://spark.apache.org/docs/latest/api/python/getting_started/quickstart_connect.html" rel="noopener noreferrer"&gt;documentation&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;As a reminder, this is how spark connect works:&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftpe4jivn6hzz01od4wxq.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftpe4jivn6hzz01od4wxq.png" alt="spark connect" width="800" height="882"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Namely, a specific server needs to be created so your tests can connect to this server and process the data as intended.&lt;/p&gt;
&lt;h3&gt;
  
  
  Why it is not enough?
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Launching the server requires some extra requirements on your machine, namely a java virtual machine.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Launching the server requires a specific script called &lt;code&gt;start-connect-server.sh&lt;/code&gt; which is to be found&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Some data engineers might argue they can just use a spark server already deployed to be able to test; but there are several drawbacks to this approach:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You are being charged to launch simple tests or run experiments keeping cloud providers very happy&lt;/li&gt;
&lt;li&gt;You slow down the &lt;strong&gt;developer feedback loop&lt;/strong&gt; which is the time necessary to implement a feature and validates that no regression has been introduced. A developer is more confident to have no regression when tests are all executed&lt;/li&gt;
&lt;li&gt;You create &lt;strong&gt;external dependencies&lt;/strong&gt; that you have no control off. You might encounter issues with testing when the cloud provider is down, or you don't have internet access or someone changes the configuration of the server by accident.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The goal is to have a test environment that is self descriptive, quick to setup, quick to start and reliable.&lt;/p&gt;
&lt;h2&gt;
  
  
  Chapter 1: Setup
&lt;/h2&gt;

&lt;p&gt;In this chapter, multiples tool will be introduced and setup. The intent is to have a clean python environment to reproduce the code. This is a very opinionated section, but it might be useful to challenge your existing tools with this section.&lt;/p&gt;
&lt;h3&gt;
  
  
  Python version management
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://mise.jdx.dev/" rel="noopener noreferrer"&gt;&lt;em&gt;Mise&lt;/em&gt;&lt;/a&gt; will be leveraged to handle python versions. It claims to be the &lt;em&gt;The front-end to your dev env&lt;/em&gt; and it will be used to install specific versions of languages and tools.&lt;/p&gt;

&lt;p&gt;It can be used for much more, and it is strongly advised to look at the &lt;a href="https://mise.jdx.dev/getting-started.html" rel="noopener noreferrer"&gt;documentation&lt;/a&gt; to understand the true power of this tool not limited to python developement.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Mise&lt;/em&gt; first needs to be installed, see &lt;a href="https://mise.jdx.dev/getting-started.html#_1-install-mise-cli" rel="noopener noreferrer"&gt;documentation&lt;/a&gt; for further instructions. You can launch the following:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl https://mise.run | sh
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Once installed, you will have to customize your &lt;code&gt;.bashrc&lt;/code&gt; or your &lt;code&gt;.zhsrc&lt;/code&gt; (or other terminal support) to activate &lt;em&gt;mise&lt;/em&gt; on your terminal.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s1"&gt;'eval "$(~/.local/bin/mise activate bash)"'&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; ~/.bashrc
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Mise&lt;/em&gt; can now be used to install python at a specific version with the following command:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;mise &lt;span class="nb"&gt;install &lt;/span&gt;python@3.12
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It will download a pre-compiled version of python and make it available globally.&lt;/p&gt;

&lt;p&gt;Let's now use it, you first need to position yourself at the root of the project and launch:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;mise use python@3.12
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It will create a &lt;a href="https://github.com/Nicoda-27/spark_tdd/blob/doc/chapter_1/mise.toml" rel="noopener noreferrer"&gt;mise.toml&lt;/a&gt; file with the following section&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight toml"&gt;&lt;code&gt;&lt;span class="nn"&gt;[tools]&lt;/span&gt;
&lt;span class="py"&gt;python&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"3.12"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And a &lt;a href="https://github.com/Nicoda-27/spark_tdd/blob/doc/chapter_1/.python-version" rel="noopener noreferrer"&gt;.python-version&lt;/a&gt; with the indication&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;3.12
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;With the help of these files, &lt;em&gt;mise&lt;/em&gt; will be able to activate when located at the root of your project. It's also a great way to document other contributors of the requirements to launch this project without relying on README that becomes easily outdated.&lt;/p&gt;

&lt;h3&gt;
  
  
  Python dependency management
&lt;/h3&gt;

&lt;p&gt;A tool to help us add, remove and download dependencies is necessary. &lt;a href="https://docs.astral.sh/uv/" rel="noopener noreferrer"&gt;Uv&lt;/a&gt;, will be used later on as it's very fast and easy to use.&lt;/p&gt;

&lt;p&gt;To install it, the official &lt;a href="https://docs.astral.sh/uv/getting-started/installation/" rel="noopener noreferrer"&gt;documentation&lt;/a&gt;; but in this tutorial &lt;em&gt;mise&lt;/em&gt; will be leveraged:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;mise use uv
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This will both install and setup uv for the project. See how [mise.toml]((&lt;a href="https://github.com/Nicoda-27/spark_tdd/blob/doc/chapter_1/mise.toml" rel="noopener noreferrer"&gt;https://github.com/Nicoda-27/spark_tdd/blob/doc/chapter_1/mise.toml&lt;/a&gt;) has been modified with the addition of:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight toml"&gt;&lt;code&gt;&lt;span class="nn"&gt;[tools]&lt;/span&gt;
&lt;span class="py"&gt;python&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"3.12"&lt;/span&gt;
&lt;span class="py"&gt;uv&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"latest"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now it can be used to initialize the project, namely:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;uv init
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This will create a folder structure for you and a &lt;code&gt;hello.py&lt;/code&gt;. In this project, we have customized it a bit to add a tests section a pyspark_tdd package as part of &lt;code&gt;src&lt;/code&gt; so it looks like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;.
├── src
│   ├── hello.py
├── tests
├── .python-version
├── .mise.toml
└── pyproject.toml
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Ignoring files
&lt;/h3&gt;

&lt;p&gt;Every repository needs a set of files to ignore before adding them to a commit. This is done via a &lt;a href="https://github.com/Nicoda-27/spark_tdd/blob/doc/chapter_1/.gitignore" rel="noopener noreferrer"&gt;.gitignore&lt;/a&gt; file and anyone can leverage existing templates for your language of preference.&lt;/p&gt;

&lt;p&gt;If you start a project from scratch, you will need to first setup git&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git init
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Github&lt;/em&gt; maintains ignore files &lt;a href="https://github.com/github/gitignore/tree/main" rel="noopener noreferrer"&gt;template&lt;/a&gt; for each language. You can leverage it with:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-L&lt;/span&gt; &lt;span class="nt"&gt;-o&lt;/span&gt; .gitignore https://raw.githubusercontent.com/github/gitignore/refs/heads/main/Python.gitignore
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The chosen language for gitignore is in this project the python template.&lt;/p&gt;

&lt;h3&gt;
  
  
  Adding formatting and linting
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Python
&lt;/h4&gt;

&lt;p&gt;Linters and formatters are powerful tools to enforce code writing rules among developers. It takes away the pain of having to care how the code is written at the syntax level.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Ruff&lt;/em&gt; will be leveraged to format our python code as it's very powerful and can be run at file saves without latency.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Ruff&lt;/em&gt; will be added as a project &lt;em&gt;dev&lt;/em&gt; dependency. A &lt;em&gt;dev&lt;/em&gt; dependency is one that the project does not need to run, it can be related to tests, experimentation, formatting etc. Everything that is not meant to be shipped to production must be retained as a &lt;em&gt;dev&lt;/em&gt; dependency to keep your python package as self contained as possible.&lt;/p&gt;

&lt;p&gt;We can add &lt;em&gt;ruff&lt;/em&gt; like so:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;uv add ruff &lt;span class="nt"&gt;--dev&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This will add a &lt;em&gt;dev&lt;/em&gt; dependency in the &lt;a href="https://github.com/Nicoda-27/spark_tdd/blob/doc/chapter_1/pyproject.toml" rel="noopener noreferrer"&gt;pyproject.toml&lt;/a&gt; with&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight toml"&gt;&lt;code&gt;&lt;span class="nn"&gt;[dependency-groups]&lt;/span&gt;
&lt;span class="py"&gt;dev&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="py"&gt;"ruff&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.8&lt;/span&gt;&lt;span class="err"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="s"&gt;",&lt;/span&gt;&lt;span class="err"&gt;
&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This will also create a &lt;code&gt;.venv&lt;/code&gt; at the current working directory. You might notice that the &lt;code&gt;.venv&lt;/code&gt; is ignored from git which is intended. Indeed, you don't want to commit your &lt;code&gt;.venv&lt;/code&gt; directory as it's a copy of the dependencies of your project and can be quite extensive.&lt;/p&gt;

&lt;p&gt;It will also create an &lt;a href="https://github.com/Nicoda-27/spark_tdd/blob/doc/chapter_1/uv.lock" rel="noopener noreferrer"&gt;uv.lock&lt;/a&gt; that documents your direct dependencies version and the indirect dependencies (the dependencies of your dependencies). This mechanism allows to segregates dependencies of your project from the rest.&lt;/p&gt;

&lt;p&gt;Your project should now look like&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;.
├── .venv
├── src
│   ├── hello.py
├── tests
├── .python-version
├── .gitignore
├── .mise.toml
├── pyproject.toml
└── uv.lock
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Other languages
&lt;/h4&gt;

&lt;p&gt;As a project is not just python files, but also configuration, pipelines, documentation etc, formatting these files too is also necessary.&lt;/p&gt;

&lt;p&gt;Documenting how these files will be formatted is done using &lt;a href="https://editorconfig.org/#overview" rel="noopener noreferrer"&gt;editorconfig&lt;/a&gt;.&lt;br&gt;
We will use the one from the &lt;a href="https://github.com/Nicoda-27/spark_tdd/blob/doc/chapter_1/.editorconfig" rel="noopener noreferrer"&gt;editorconfig&lt;/a&gt; website.&lt;/p&gt;
&lt;h4&gt;
  
  
  Your Integrated Development Environment (IDE)
&lt;/h4&gt;

&lt;p&gt;Whichever &lt;em&gt;IDE&lt;/em&gt; will be used, it's very important that you setup formatting at file saves to save you time and remove the pain from handling it by hand.&lt;/p&gt;

&lt;p&gt;If you are using VSCode, you can install the &lt;a href="https://marketplace.visualstudio.com/items?itemName=charliermarsh.ruff" rel="noopener noreferrer"&gt;ruff&lt;/a&gt; extension and adjust the following to your &lt;em&gt;settings.json&lt;/em&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="nl"&gt;"editor.formatOnSave"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="err"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="nl"&gt;"[python]"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"editor.formatOnSave"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"editor.codeActionsOnSave"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"source.fixAll"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"explicit"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"source.organizeImports"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"explicit"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"editor.defaultFormatter"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"charliermarsh.ruff"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="err"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  The first test
&lt;/h3&gt;

&lt;p&gt;To see if everything works as expected, you will write a very simple unit test. In a test driven approach, the test is written before the source code.&lt;/p&gt;

&lt;p&gt;A test framework is required to launch the test automation, &lt;a href="https://docs.pytest.org/en/stable/" rel="noopener noreferrer"&gt;&lt;em&gt;pytest&lt;/em&gt;&lt;/a&gt; will be used. You need to add it as a &lt;em&gt;dev&lt;/em&gt; dependency&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;uv add pytest &lt;span class="nt"&gt;--dev&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can create a &lt;a href="https://github.com/Nicoda-27/spark_tdd/blob/doc/chapter_1/tests/test_dummy.py" rel="noopener noreferrer"&gt;&lt;code&gt;tests/test_dummy.py&lt;/code&gt;&lt;/a&gt; with the following code:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;your_python_package.multiply&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;multiply&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;test_my_dummy_function&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="nf"&gt;multiply&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This requires a function &lt;code&gt;multiply&lt;/code&gt; that can be defined as in &lt;code&gt;src/your_python_package/multiply.py&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;multiply&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can now run the tests, make sure you're using the right python from the &lt;code&gt;.venv&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;which python
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;should display something like &lt;code&gt;/$HOME/somepath/your_project/.venv/bin/python&lt;/code&gt;. If not, you can restart a new terminal, &lt;em&gt;mise&lt;/em&gt; should be able to resolve.&lt;/p&gt;

&lt;p&gt;Then run&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pytest
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then it will display an error:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;tests/test_dummy.py:1: in &amp;lt;module&amp;gt;
    from your_python_package.multiply import multiply
E   ModuleNotFoundError: No module named 'your_python_package'
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You need to add an extra entry for &lt;em&gt;pytest&lt;/em&gt; to detect the &lt;code&gt;src&lt;/code&gt; layout. In &lt;a href="https://github.com/Nicoda-27/spark_tdd/blob/doc/chapter_1/pyproject.toml" rel="noopener noreferrer"&gt;pyproject.toml&lt;/a&gt;, you can add:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight toml"&gt;&lt;code&gt;&lt;span class="nn"&gt;[tool.pytest.ini_options]&lt;/span&gt;
&lt;span class="py"&gt;pythonpath&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"src"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pytest
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;should display&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;============================================================= test session starts ===============================================================
platform linux -- Python 3.12.8, pytest-8.3.4, pluggy-1.5.0
rootdir: /home/somepath/src/your_project
configfile: pyproject.toml
collected 1 item                                                                                                                                 

tests/test_dummy.py .                                                                                                                      [100%]

=============================================================== 1 passed in 0.01s ================================================================
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can do some housekeeping and remove the unnecessary &lt;code&gt;src/your_python_package/hello.py&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;You now have a proper setup to start working.&lt;/p&gt;

&lt;h3&gt;
  
  
  What's next
&lt;/h3&gt;

&lt;p&gt;Now that one test is implemented, the continuous integration (ci) must be setup. In a collaborative way of working, the ci is the only source of truth to guarantee if everything is broken or not.&lt;/p&gt;

&lt;p&gt;Notice we still have not touched upon any spark components, it's very important to have a clean reproducible codebase before diving.&lt;/p&gt;

&lt;p&gt;That will be the topic of the next chapter.&lt;/p&gt;

&lt;p&gt;You can find the original materials in &lt;a href="https://github.com/Nicoda-27/spark_tdd" rel="noopener noreferrer"&gt;spark_tdd&lt;/a&gt;. This repository exposes what's the expected repository layout at the end of each chapter in each branch:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://github.com/Nicoda-27/spark_tdd/tree/doc/chapter_0" rel="noopener noreferrer"&gt;chapter 0&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/Nicoda-27/spark_tdd/tree/doc/chapter_1" rel="noopener noreferrer"&gt;chapter 1&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;[23/02/25 UPDATE]: &lt;a href="https://dev.to/nda_27/how-to-be-test-driven-with-spark-2-ci-4a28"&gt;Chapter 2&lt;/a&gt; has been released&lt;br&gt;
[03/03/25 UPDATE]: &lt;a href="https://dev.to/nda_27/how-to-be-test-driven-with-spark-chapter-3-first-spark-test-le"&gt;Chapter 3&lt;/a&gt; has been released&lt;br&gt;
[09/03/25 UPDATE]: &lt;a href="https://dev.to/nda_27/how-to-be-test-driven-with-spark-chapter-4-leaning-into-property-based-testing-2hln"&gt;Chapter 4&lt;/a&gt; has been released&lt;br&gt;
[15/03/25 UPDATE]: &lt;a href="https://dev.to/nda_27/how-to-be-test-driven-with-spark-chapter-5-leverage-spark-in-a-container-1p74"&gt;Chapter 5&lt;/a&gt; has been released&lt;/p&gt;

</description>
      <category>python</category>
      <category>ruff</category>
      <category>testing</category>
      <category>productivity</category>
    </item>
  </channel>
</rss>
