Forem: Nicoda-27

How to be Test Driven with Spark: Chapter 6: Improve the setup using devcontainer

Nicoda-27 — Fri, 17 Apr 2026 21:29:29 +0000

This goal of this tutorial is to provide a way to easily be test driven with spark on your local setup without using cloud resources.

This is a series of tutorials and the initial chapters can be found in:

We will be focusing on chapter 6 on improving the developer system for better reusability, reproducibility and also leverage this approach on the ci setup.

In chapter 2, we mentioned devcontainers as a way to make the development environment explicit.

A development container (devcontainer) describes the developer environment as an OCI image (often built with a Dockerfile). The usual runtime is Docker, but tools such as Podman are compatible with the same workflow. For simplicity, this chapter assumes Docker is installed on your machine.

The full specification lives in the Dev Container Specification on containers.dev. What follows is only a small subset of what devcontainers can express.

The devcontainer specification

The repository uses a .devcontainer directory to hold the image definition. The Dockerfile is the main build recipe; we walk through it below.

The first line selects the Dockerfile syntax version. The base image is Debian (debian:trixie-slim); you can swap it for another image if you need a smaller footprint or a different distribution.

# syntax=docker/dockerfile:1.4
FROM debian:trixie-slim AS build

The optional FORCE_REBUILD argument is a cache-busting knob: changing its default value invalidates Docker’s layer cache for everything that follows, which is useful when you want a full rebuild without editing other lines.

ARG FORCE_REBUILD=20260417

As in chapter 1, mise drives tool versions. The mise.toml file is copied into the build context so mise install can install uv (and anything else declared there).

Extra environment variables pin where mise and uv install binaries and Python:

COPY mise.toml /mise.toml

ENV UV_TOOL_BIN_DIR=/usr/local/bin \
    UV_TOOL_DIR=/opt/uv/venv \
    UV_PYTHON_INSTALL_DIR=/opt/uv/python \
    MISE_DATA_DIR=/opt/mise

ENV PATH="$MISE_DATA_DIR/shims:$PATH"

System packages and tooling (for example git, zip, and the Docker CLI) are installed in devcontainer-setup.sh, which is copied in and executed next:

COPY devcontainer-setup.sh /devcontainer-setup.sh
RUN /devcontainer-setup.sh

WORKDIR /code

FROM build AS devcontainer

The final stage devcontainer matches the target in devcontainer.json, which also selects the Dockerfile, platform, and IDE extensions (here, the Python extension for VS Code).

Build the image from the repository root:

docker build -f .devcontainer/Dockerfile --target devcontainer .devcontainer

Using the devcontainer in your IDE

Modern editors can open the project inside the container using a devcontainer extension—for VS Code, see Developing inside a Container.

That gives newcomers a reproducible environment: the extension detects .devcontainer/, builds (or pulls) the image using devcontainer.json, and starts a shell where tools from the image are already on PATH. Much of what chapter 1 described as manual setup becomes versioned files in the repo, which you can test in CI so they stay accurate.

Using the devcontainer in CI

Reusing the same image in continuous integration avoids depending on whatever happens to be preinstalled on the GitHub-hosted runner: the maintainer owns the image, so runner image updates do not silently change your pipeline. That improves reproducibility.

The workflow .github/workflows/ci.yaml implements this pattern.

Image tag

A tag is derived from a hash of every file under .devcontainer/, so the image only changes when that folder’s content changes (see hashFiles). The tag is written to $GITHUB_OUTPUT (so later jobs can use needs.build-and-push.outputs.tag) and to $GITHUB_ENV as DEVCONTAINER_TAG.

      - name: Compute devcontainer image tag
        id: devcontainer_tag
        run: |
          TAG="devcontainer-${{ hashFiles('.devcontainer/**') }}"
          echo "tag=${TAG}" >> "$GITHUB_OUTPUT"
          echo "DEVCONTAINER_TAG=${TAG}" >> "$GITHUB_ENV"

The build-and-push job exposes that tag to other jobs with outputs.tag: ${{ steps.devcontainer_tag.outputs.tag }}.

Login, pull cache, then build if missing

The job logs in to Docker Hub, then tries to pull the image. If that tag already exists in the registry, the build is skipped; otherwise Buildx builds and pushes.

Configure a repository variable DOCKERHUB_REPOSITORY (for example youruser/spark-tdd-devcontainer) and secrets DOCKERHUB_USERNAME and DOCKERHUB_TOKEN. The container.image field cannot use the secrets context for the image name, which is why the repository name lives in vars.

      - name: Log in to Docker Hub
        uses: docker/login-action@v3
        with:
          username: ${{ secrets.DOCKERHUB_USERNAME }}
          password: ${{ secrets.DOCKERHUB_TOKEN }}

      - name: Pull devcontainer image if already published
        id: pull
        continue-on-error: true
        env:
          REPO: ${{ vars.DOCKERHUB_REPOSITORY }}
        run: docker pull "${REPO}:${DEVCONTAINER_TAG}"

      - name: Set up Docker Buildx
        if: steps.pull.outcome != 'success'
        uses: docker/setup-buildx-action@v3

      - name: Build and push devcontainer image
        if: steps.pull.outcome != 'success'
        uses: docker/build-push-action@v6
        with:
          context: .devcontainer
          file: .devcontainer/Dockerfile
          target: devcontainer
          push: true
          tags: ${{ vars.DOCKERHUB_REPOSITORY }}:${{ env.DEVCONTAINER_TAG }}

Downstream jobs

Formatting and tests run inside that image via jobs.<job_id>.container, using the tag exported by the build-and-push job output (still driven by the same devcontainer_tag step):

  Formatting:
    runs-on: ubuntu-latest
    needs: [build-and-push]
    container:
      image: ${{ vars.DOCKERHUB_REPOSITORY }}:${{ needs.build-and-push.outputs.tag }}
      credentials:
        username: ${{ secrets.DOCKERHUB_USERNAME }}
        password: ${{ secrets.DOCKERHUB_TOKEN }}

The test job also mounts the host Docker socket so Testcontainers can start sibling containers (for example Spark) from within the job container.

Conclusion

We have now documented the developer setup as code and it's tested. It's a great step toward "code as documentation".

You can find the original materials in spark_tdd. This repository exposes what's the expected repository layout at the end of each chapter in each branch:

What's next

Several ideas come to mind on how to improve our very small codebase

Rework the spark container to prebuild the docker image, as it can be quite slow when extra package like deltalake, dremio are necessary
Templatize the repository for easier reusage with the help of ffizer
Explore ibis to handle multiple transformation backends transparently

How to for developers: Mastering your corporate MacBook Setup

Nicoda-27 — Sat, 17 May 2025 09:51:05 +0000

Starting with a fresh MacBook can be exciting, but navigating corporate IT requirements can feel daunting. This article demystifies the process with a step-by-step guide so you can set up your machine smoothly and in line with company policy, and stay productive from day one.

This article focuses on a Python developer persona, but most of it applies to any developer.

A corporate MacBook

Depending on your company, the MacBook provided as a developer workstation can be harder to work with than a privately owned MacBook.

Namely:

You will not have sudo on your workstation.
A proxy may be required by company policy.

Not having sudo is the main constraint; we will see how to work within it and still comply with policy—that means we will not bypass controls, but we will use what macOS allows.

Homebrew

Homebrew is the go-to package manager for developers on macOS. It plays a similar role to apt on Ubuntu.

The official documentation expects sudo for the default install. You can also install it for your user only with:

mkdir $HOME/homebrew && curl -L https://github.com/Homebrew/brew/tarball/master | tar xz --strip 1 -C homebrew

This creates a homebrew directory under $HOME where Homebrew is installed.

Add it to your PATH permanently. In the rest of this article, zsh is assumed as your shell (but you can use any shell you prefer).

echo 'export PATH=$HOME/homebrew/bin:$PATH' >> ~/.zshrc

Check that Homebrew works

The following command applies the PATH change in your current terminal:

source ~/.zshrc

The next command should print a version and confirms Homebrew is installed and usable:

brew --version

CLI tools

Install a first tool:

brew install git

You can install other developer tools available as Homebrew formulae; they are installed for your user without sudo.

GUI applications

GUI apps need an Applications folder. By default, Homebrew Cask targets /Applications at the root of the disk, which you may not be allowed to write to.

For example, try OpenLens—a tool that provides a UI to inspect your Kubernetes cluster. It is available as a Homebrew cask.

If you run:

brew install --cask openlens

that install can fail because Homebrew cannot use /Applications.

Instead, create an Applications folder under your home directory:

mkdir $HOME/Applications

Then tell Homebrew to use it:

brew install --cask openlens --appdir $HOME/Applications

You should be able to open OpenLens from Spotlight. You can install GUI apps this way without sudo.

Rosetta

Sometimes you need a specific CPU architecture for a binary (either Apple silicon or Intel).

Apple silicon (M1, M2, M3, M4) Macs use arm64 natively. Rosetta lets you run x86_64 binaries. That helps when libraries exist only for one architecture; x86_64 is older, so binaries are often available there first.

You can add aliases to start a shell under Rosetta or natively. Add these to your .zshrc:

alias arm="env /usr/bin/arch -arm64 /bin/zsh --login"
alias intel="env /usr/bin/arch -x86_64 /bin/zsh --login"

To see which architecture your current shell is using, run:

arch

You should see either i386 (Intel / Rosetta) or arm64 (Apple silicon).

Installing language-specific tools

To manage multiple versions of Python, Node, the AWS CLI, Cargo, and more, this guide uses mise.

You can run two setups—one native and one under Rosetta—so environments match each architecture, as described in the mise macOS Rosetta notes.

Follow that installation path; you should then have the x86_64 binary available as:

mise-x64 --version

You can also use the standard arm64 mise from Homebrew if you do not need x86_64-specific toolchains:

brew install mise

You can then install Python, Node, and many other tools with mise.

Example: install Python for an x86_64 toolchain:

mise-x64 use python@3.10

Containerization tools

For containers, several options exist.

The most common is Docker; open-source alternatives such as Podman exist as well.

Installing PHP

It is a bit fiddly, but doable: you can use mise-x64 as follows:

PHP_CONFIGURE_OPTIONS="--with-openssl=$(brew --prefix openssl) --with-iconv=$(brew --prefix libiconv)" mise-x64 use php@8.4 --global

Because ARM support can be limited, an x86_64 build is often more reliable here.

Docker

You will usually need IT to install Docker Desktop, because it requires elevated privileges.

Podman

Podman can replace Docker for many workflows and does not require the same elevated setup. Install it with Homebrew or follow the Podman Desktop macOS installation guide:

brew install podman

You can alias podman to docker for convenience in your .zshrc:

alias docker=podman

The Podman CLI is largely compatible with the Docker CLI.

Try:

docker run hello-world

On first use you may need to initialize the Podman machine:

podman machine init
podman machine start

See the Podman Desktop macOS troubleshooting page if something fails.

Disclaimer: Some stacks, such as Testcontainers for Python, expect a real Docker daemon and may not work fully with Podman. podman-mac-helper and socket compatibility can help, but that often still needs cooperation from IT.

Proxies and certificates

Your company may use an HTTP proxy with its own TLS certificate. This section skips why that exists and focuses on how to work with it.

Tools such as the Azure CLI may fail until trust is configured; see Azure CLI: work behind a proxy. That example is a good check that your proxy and certificates are set up correctly.

As documented, set REQUESTS_CA_BUNDLE to the path of your combined CA bundle. The same idea applies when Python tools need HTTPS access to install dependencies.

Creating a certificate bundle

Build one PEM file that merges system and corporate roots; you will point tools at it.

Create a directory and an environment variable:

mkdir $HOME/certs
echo 'export CORPORATE_CERT_DIR=$HOME/certs' >> ~/.zshrc

Export certificates from the keychains and concatenate them into allCAbundle.pem:

security export -t certs -f pemseq -k /Library/Keychains/System.keychain -o $CORPORATE_CERT_DIR/selfSignedCAbundle.pem
security export -t certs -f pemseq -k /System/Library/Keychains/SystemRootCertificates.keychain -o $CORPORATE_CERT_DIR/bundleCA.pem
cat $CORPORATE_CERT_DIR/bundleCA.pem $CORPORATE_CERT_DIR/selfSignedCAbundle.pem >> $CORPORATE_CERT_DIR/allCAbundle.pem

To inspect the file:

cat $CORPORATE_CERT_DIR/allCAbundle.pem

Using the bundle

Append exports to .zshrc. Different tools use different variables; setting several covers most cases:

echo 'export REQUESTS_CA_BUNDLE=$CORPORATE_CERT_DIR/allCAbundle.pem' >> ~/.zshrc
echo 'export SSL_CERT_FILE=$CORPORATE_CERT_DIR/allCAbundle.pem' >> ~/.zshrc
echo 'export CURL_CA_BUNDLE=$CORPORATE_CERT_DIR/allCAbundle.pem' >> ~/.zshrc
echo 'export NODE_EXTRA_CA_CERTS=$CORPORATE_CERT_DIR/allCAbundle.pem' >> ~/.zshrc

After that, az and typical package managers (uv, poetry, npm, etc.) should work through the proxy.

If you are off the corporate network and the proxy is disabled, you may need to unset these variables (for example unset REQUESTS_CA_BUNDLE) before commands like az login, depending on your environment.

Java specifics

The JVM uses its own trust store. You can import a PEM into a JKS with keytool.

You can run keytool from a container so you do not install a full JDK locally if you prefer not to.

The following example mounts your cert directory and runs keytool inside an image that already ships Java (here apache/spark; any image with keytool is fine):

docker run -v $CORPORATE_CERT_DIR:/opt/java/openjdk/lib/security/jssecacerts -it apache/spark keytool -import -v -trustcacerts -alias endeca-ca -file /opt/java/openjdk/lib/security/jssecacerts/my_custom_certificate.pem -keystore /opt/java/openjdk/lib/security/jssecacerts/truststore.ks

You cannot point keytool at the merged allCAbundle.pem in every case—you often need a single PEM for the corporate issuing CA. Export the right certificate from Keychain Access or your IT docs.

Example lookup:

security find-certificate -p -c "corporate_proxy_name" /Library/Keychains/System.keychain

Then export it:

security find-certificate -a -c "corporate_proxy_name" -p /Library/Keychains/System.keychain >$CORPORATE_CERT_DIR/my_custom_certificate.pem

Or obtain it from a TLS handshake (adjust host and options to match your environment):

curl -w %{certs} https://example.com > $CORPORATE_CERT_DIR/my_custom_certificate.pem

Import with keytool as above; you will be prompted for a keystore password.

Point the JVM at the JKS when you run apps, for example:

-Djavax.net.ssl.trustStore=/opt/java/openjdk/lib/security/jssecacerts/truststore.ks -Djavax.net.ssl.trustStorePassword=your_password

Storing paths and passwords in environment variables (without committing secrets) keeps builds repeatable:

echo 'export CORPORATE_JKS_CERT_PATH=$CORPORATE_CERT_DIR/truststore.ks' >> ~/.zshrc
echo 'export CORPORATE_JKS_CERT_PASS=your_password' >> ~/.zshrc

Docker and corporate TLS

At build time

When you docker build, the build may also need your CA bundle for HTTPS.

You can pass files in with BuildKit --build-context, for example:

docker build -f your_dockerfile --build-context config=$CORPORATE_CERT_DIR your_docker_context

In the Dockerfile, mount that context and set SSL_CERT_FILE / related variables before apt-get or similar:

RUN --mount=type=bind,from=config,target=/tmp/certs \
    export REQUESTS_CA_BUNDLE=/tmp/certs/allCAbundle.pem \
    && export SSL_CERT_FILE=/tmp/certs/allCAbundle.pem \
    && export CURL_CA_BUNDLE=/tmp/certs/allCAbundle.pem \
    && apt-get update && apt-get install -y git

At runtime

If a container needs the corporate CA at runtime, mount the PEM and append it to the image trust store in an entrypoint, or bake it in during build—what follows is one illustrative pattern:

docker run -it -v $CORPORATE_CERT_DIR/my_custom_certificate.pem:/certs/my_custom_certificate.pem postgres /bin/bash -c "cat /certs/my_custom_certificate.pem >> /etc/ssl/certs/ca-certificates.crt && /bin/bash"

(Adjust paths and base image to match your stack.)

Conclusion

Hopefully this helps you stay compliant with corporate constraints while remaining productive.

If something is unclear or a topic is missing, say so in the comments.

Update (2026-04-18): Added Java and Docker notes for proxy certificates, and PHP installation. Proof reading

How to be Test Driven with Spark: Chapter 5: Leverage spark in a container

Nicoda-27 — Sat, 15 Mar 2025 07:33:58 +0000

This goal of this tutorial is to provide a way to easily be test driven with spark on your local setup without using cloud resources.

This is a series of tutorials and the initial chapters can be found in:

In chapter 3, it was demonstrated that the current testing approach rely on Java being available on the developer setup. As mentioned, this is not ideal as there is limited control and unexpected behavior can happen. A good testing practice is to have reproducible and idempotent tests, this means:

Launching the tests an infinite number of times should always have the same results
A test should leave a clean plate after it has run, there should be no side effect to a test running (no files written, no change of environment variables, no database with remaining data etc)

The reasons why it's so important, is because otherwise you will spend most of your time relaunching the tests due to false positive, you would never be sure if you actually broke something or if the test is randomly failing. At the end, you will not trust the tests anymore and skip some of them, which defeats the purpose.

Why using a container?

If you are unfamiliar with the concept of containers and docker images, I suggest you have a look at docker. It will be leveraged here to start the Spark server for the tests; it's important to mention there are other opensource alternatives like podman or nerdctl to allow containerization.

Docker will be used thereafter as it has become the defacto standards for most companies, and it's available in the Github ci runner. It will be assumed that you have enough knowledge about the technology to use it.

Container with spark connect

There is a small subtlety that needs to be understood. Previously, the Java Virtual Machine (JVM) was used to communicate with the python spark implementation (through the spark_session), it was using the java binary to create a swarm of workers that were handling the data processing. At the end, all the results were collected and communicated to the spark_session which was exposing it in the python code.

If you start a container with this, the spark_session will never be able to find the JVM inside the container as it's a binary. The container you want to create needs a way to communicate outside with the spark_session through the network. Luckily, Spark connect is providing a solution and the documentation is a must known. This is the chosen approach to containerize the Spark server and the worker creation.

Spark is already providing a docker image that you will leverage. If you don't have docker available on your setup, you will need to install it, see the official documentation.

Let's uninstall openjdk to make sure spark_session will use the new setup, it will require elevation of privileges:

apt-get autoremove openjdk-8-jre

You can now relaunch the tests, it's expected that they fail with the following error:

ERROR tests/test_minimal_transfo.py::test_minimal_transfo - pyspark.errors.exceptions.base.PySparkRuntimeError: [JAVA_GATEWAY_EXITED] Java gateway process exited before sending its port number.
ERROR tests/test_minimal_transfo.py::test_transfo_w_synthetic_data - pyspark.errors.exceptions.base.PySparkRuntimeError: [JAVA_GATEWAY_EXITED] Java gateway process exited before sending its port number.

Start the container

You will need to start the container with spark connect, you can launch

docker run -p 8081:8081 -e SPARK_NO_DAEMONIZE=True --name spark_connect apache/spark /opt/spark/sbin/start-connect-server.sh org.apache.spark.deploy.master.Master --packages org.apache.spark:spark-connect_2.12:3.5.2,io.delta:delta-core_2.12:2.3.0 --conf spark.driver.extraJavaOptions='-Divy.cache.dir=/tmp -Divy.home=/tmp' --conf spark.connect.grpc.binding.port=8081

It will print a lot in the terminal and at the end you should have:

24/12/27 14:04:27 INFO SparkConnectServer: Spark Connect server started at: 0:0:0:0:0:0:0:0%0:8081

This shows that the Spark server is up and running.

Each argument in the above command has a meaning and its importance:

docker run is the docker command to start a container
-p 8081:8081 is an arguments to docker run that enables to use port 8081 to communicate with the created container
-e SPARK_NO_DAEMONIZE=True is an environment variable that is passed to the container creation, it's necessary to use it for the server to be created as a foreground process
--name spark_connect allows to name the created container
apache/spark is the docker image that is used, if you never used it, it will be downloaded from Docker Hub

The rest of the command is what is called an entrypoint, it's the command that will be executed inside the container. In here it contains multiple elements:

/opt/spark/sbin/start-connect-server.sh is the binary of the spark server
org.apache.spark.deploy.master.Master is an argument to the binary, in here the binary is asked to deploy a Master server, the same binary can be used to deploy a Worker
--packages org.apache.spark:spark-connect_2.12:3.5.2,io.delta:delta-core_2.12:2.3.0 is an optional argument to pass specific versions of spark, and delta dependencies
--conf spark.driver.extraJavaOptions='-Divy.cache.dir=/tmp -Divy.home=/tmp' is extra argument to ask the server to write to /tmp inside the container, it's not a mandatory argument
--conf spark.connect.grpc.binding.port=8081 is an extra argument to start the server on the port 8081 on the localhost of the container

The last argument is where the magic happens, the server is started on port 8081, and docker is exposing the port of this container to the port of the docker host. Meaning, a spark server is now available on http://localhost:8081

Use the container

Keep the previous terminal opened to keep the server running and open a new terminal. Now run:

pytest -k test_transfo_w_synthetic_data -s

The same error should appear, indeed the spark_session needs to be adapted to connect to the server you have just created. In test/conftest.py:

@pytest.fixture(scope="session")
def spark_session() -> Generator[SparkSession, Any, Any]:
    yield (
        SparkSession.builder.remote("sc://localhost:8081")  # type: ignore
        .appName("Testing PySpark Example")
        .getOrCreate()
    )

Basically, it indicates the Spark connect server url to the Spark session.

And you need to add an extra dependency, which is mandatory to communicate with the spark connect server. It's worth pointing to the usage of extras in uv:

uv add pyspark --extra connect

As this project is in Python 3.12, another error will appear related to distutils as it was removed from the latest python version, yet some dependencies still requires it. You will have to add:

uv add setuptools

Now you can run:

pytest -k test_minimal_transfo -s

And it should run successfully, you should also see logs in the spark server in the docker run terminal.

Improve the container usage

As mentioned at the beginning of this chapter, the tests need to leave a clean plate. In the previous approach, a container is still running eventhough the tests are done, it's not ideal.

To improve this, you will leverage testcontainers which empower you with easy docker creation and removal at the test level.

uv add testcontainers --dev

Now, the docker can be started at the session fixture level, in tests/conftest.py, you can add an extra fixture:

from testcontainers.core.container import DockerContainer
from testcontainers.core.waiting_utils import wait_for_logs

@pytest.fixture(scope="session")
def spark_connect_start():
    kwargs = {
        "entrypoint": "/opt/spark/sbin/start-connect-server.sh org.apache.spark.deploy.master.Master --packages org.apache.spark:spark-connect_2.12:3.5.2,io.delta:delta-core_2.12:2.3.0 --conf spark.driver.extraJavaOptions='-Divy.cache.dir=/tmp -Divy.home=/tmp' --conf spark.connect.grpc.binding.port=8081",
    }
    with (
        DockerContainer(
            "apache/spark",
        )
        .with_bind_ports(8081, 8081)
        .with_env("SPARK_NO_DAEMONIZE", "True")
        .with_kwargs(**kwargs) as container
    ):
        _ = wait_for_logs(
            container, "SparkConnectServer: Spark Connect server started at"
        )
        yield container

This will create a container with the previously described argument, the great thing with fixtures is that will kill the container at the end of the test execution. There is an extra step with:

        _ = wait_for_logs(
            container, "SparkConnectServer: Spark Connect server started at"
        )

This enforces to yield the container only when the SparkConnectServer: Spark Connect server started at appeared in the container logs. It's necessary to wait for the server to be ready until it can be called.

The value that is yielded is the container which also contains the server url, you need to reuse in the spark_session fixture:

@pytest.fixture(scope="session")
def spark_session(spark_connect_start: DockerContainer) -> Generator[SparkSession, Any, Any]:
    ip = spark_connect_start.get_container_host_ip()
    yield (
        SparkSession.builder.remote(f"sc://{ip}:8081")  # type: ignore
        .appName("Testing PySpark Example")
        .getOrCreate()
    )

You can now stop the container you started before

docker stop spark_connect

And run the tests:

pytest

You will notice all the tests are passing, and at the end of the test session there is no running containers.

The following command will show what remaining containers are still running. The spark container should not appear.

docker ps -a

Conclusion

You are now able to run local tests using spark and you can quickly iterate on your codebase and implement new features. You are no more depending on spark server to be launched for you on the cloud and waiting for it to process the data for you.

The feedback loop is quicker, you are no more giving money to cloud provider for testing purposes and you provide an easy setup for developers to iterate on your project.

They can launch pytest and will be transparent; this also means less documentation for you to write to describe the expected developer setup.

You can find the original materials in spark_tdd. This repository exposes what's the expected repository layout at the end of each chapter in each branch:

What's next

Several ideas come to mind on how to improve our very small codebase

Leverage devcontainer to improve ci and local development
Templatize the repository for easier reusage with the help of ffizer
Explore ibis to handle multiple transformation backends transparently

How to be Test Driven with Spark: Chapter 4 - Leaning into Property Based Testing

Nicoda-27 — Sun, 09 Mar 2025 08:38:56 +0000

This goal of this tutorial is to provide a way to easily be test driven with spark on your local setup without using cloud resources.

This is a series of tutorials and the initial chapters can be found in:

The test that you implemented in Chapter 3 is great, yet not complete as it takes only a limited amount of data. As spark is used to process data at scale, you have to test at scale too.

There are several solutions, the first one being taking a snapshot of production data and reusing at the test level (meaning integration test or local test). The second one is to generate synthetic data based on the data schema. With the second approach, you will be leaning into a property based testing approach.

The second approach will be leveraged here as the test case generation is deported to automated generation.

The python ecosystem provides Hypothesis for proper property based testing, or Faker for fake data generation. Hypothesis is way more powerful than Faker in the sense that it will generate test cases for you based on data property (being a string, being an integer etc) and shrink the test cases when unexpected behavior happen. Faker will be used here to generate synthetic data based on business property.

A data driven test

You need two new fixtures similar to persons and employments that will generate synthetic data. First you need to install faker as a dev dependency:

uv add faker --dev

You can create persons_synthetic in tests/conftest.py like so:

@pytest.fixture(scope="session")
def persons_synthetic(spark_session: SparkSession) -> Generator[DataFrame, Any, Any]:
    fake = Faker()

    nb_elem = fake.pyint(1, 100_000)
    data = [
        (i, fake.first_name(), fake.last_name(), fake.date()) for i in range(nb_elem)
    ]
    yield spark_session.createDataFrame(
        data,
        ["id", "PersonalityName", "PersonalitySurname", "birth"],
    )

In the above, a data frame of 100 000 rows is generated, feel free to increase the size to generate larger data frames. Fake names, surnames and date are generated on the fly according to business needs.

You can also create employments_synthetic in tests/conftest.py, there is a dependency on foreign_key from persons_synthetic that needs to be handled:

@pytest.fixture(scope="session")
def employments_synthetic(
    spark_session: SparkSession, persons_synthetic: DataFrame
) -> Generator[DataFrame, Any, Any]:
    fake = Faker()
    persons_sample = persons_synthetic.sample(0.8)
    person_ids_sample = persons_sample.select(collect_list("id")).first()[0]

    data = [(idx, id_fk, fake.job()) for idx, id_fk in enumerate(person_ids_sample)]
    yield spark_session.createDataFrame(
        data,
        ["id", "person_fk", "Employment"],
    )

The foreign_key is reused from a sample of persons_synthetic and job name are generated on the fly.

The test can now be created:

def test_transfo_w_synthetic_data(
    persons_synthetic: DataFrame, employments_synthetic: DataFrame, spark_session
):
    processor = DataProcessor(spark_session)
    df_out: DataFrame = processor.run(persons_synthetic, employments_synthetic)

    assert not df_out.isEmpty()
    assert set(df_out.columns) == set(
        ["name", "surname", "date_of_birth", "employment"]
    )

And you can launch pytest -k test_transfo_w_synthetic_data -s that should pass.

How to handle slow tests

You might notice that test_transfo_w_synthetic_data is a bit slow, indeed it's generating a decent amount of data (even though far from a big data scale), modifying the data frames and joining two together.

In a test driven approach, it's necessary to have a quick feedback loop to iterate quickly on your local setup. Yet, this tests needs to be launched anyway as they validate behavior with decent amount of data.

A solution is to add tags to tests like so:

import pytest
...

@pytest.mark.slow
def test_transfo_w_synthetic_data(
    persons_synthetic: DataFrame, employments_synthetic: DataFrame, spark_session
):
...

This tag can be leveraged by pytest to filter out tests at execution time, see documentation.

and add to pyproject.toml the expected markers for Pytest

[tool.pytest.ini_options]
pythonpath = ["src"]
markers = ["slow"]

Pytest is now aware of this new marker when launching:

pytest --markers

You can now launch:

pytest -m "not slow"

It will validate only the tests not marked as slow.

In the ci, there is nothing to change as by default Pytest will launch all the test.

What's next?

On the next chapter, the next chapter will focus on test repeatability by improving how java is used for Spark at the test level.

You can find the original materials in spark_tdd. This repository exposes what's the expected repository layout at the end of each chapter in each branch:

[15/03/25 UPDATE]: Chapter 5 has been released
[18/04/26 UPDATE]: Chapter 6 has been released

How to be Test Driven with Spark: Chapter 3 - First Spark test

Nicoda-27 — Sat, 01 Mar 2025 08:01:00 +0000

This goal of this tutorial is to provide a way to easily be test driven with spark on your local setup without using cloud resources.

This is a series of tutorials and the initial chapters can be found in:

Chapter 3: Implement a first test with spark

This chapter will focus on implementing a first spark data manipulation with an associated test. It will go through the issues that will be encountered and how to solve them.

The data

A dummy use case is used to demonstrate the workflow.

The scenario is that production data is made of two tables persons and employments with the following schema and data types. Here is a sample of the data.

Persons

id: int	PersonalityName: str	PersonalitySurname: str	birth: datetime(str)
1	George	Washington	1732-02-22
2	Henry	Ford	1863-06-30
3	Benjamin	Franklin	1706-01-17
4	Martin	Luther King Jr.	1929-01-15

Employments

id: int	person_fk: int	Employment
1	1	president
2	2	industrialist
3	3	inventor
4	4	minister

The goal is to change the names of the columns and to join the data. The data here is just a sample, it's overkill to use spark to process data like this. Yet, in a big data context, you need to foresee that the data will contains more lines and more complex joins. The sample is just here as a demonstration.

The dummy test

First, you need to add spark dependencies

uv add pyspark

Before diving into the implementation, you need to make sure you can reproduce a very simple use case. It's not worth diving into complex data manipulation if you are not able to reproduce simple documentation snippet.

You will write your first test test_minimal_transfo.py. You will try first to use pyspark to do simple data frame creation.

from pyspark.sql import SparkSession


def test_minimal_transfo():
    spark: SparkSession = (
        SparkSession.builder.master("local")
        .appName("Testing PySpark Example")
        .getOrCreate()
    )

    df = spark.createDataFrame([(3, 4), (1, 2)], ["col1", "col2"])
    df.show()

The first part with the session create or fetch a local spark session, the second part leverages the session to create a data frame.

Then you can launch:

pytest -k test_minimal_transfo -s

If you have a minimal developer setup, it should not work because it's trying to use Java which you might be missing and the following error will be displayed:

FAILED tests/test_minimal_transfo.py::test_minimal_transfo - pyspark.errors.exceptions.base.PySparkRuntimeError: [JAVA_GATEWAY_EXITED] Java gateway process exited before sending its port number

It's a bit annoying, because you need to have Java installed on our dev setup, the ci setup and all your collaborators setup. On the future chapters, a better alternative will be described.

There are different flavors of Java, you can simply install the openjdk one. It will require elevation of privileges:

apt-get install openjdk-8-jre

You can now relaunch:

pytest -k test_minimal_transfo -s

and it should display

+----+----+                                                                     
|col1|col2|
+----+----+
|   3|   4|
|   1|   2|
+----+----+

This is a small victory, but you can now use a local spark session to manage data frames, yay !

The real test case - version 0

On the previous sample, it shows that the spark session plays a pivotal role, it will be instantiated differently in the tests context than in the production context.

This means we can leverage a pytest fixture to be reused for all tests later on; it can be created at the session level so there is only one spark session for the whole test suite. Meaning, you can create a tests/conftest.py to factorize common behavior. If you are not familiar with pytest and fixtures, it's advised to have a look at documentation.

In tests/conftest.py:

from typing import Any, Generator

import pytest
from pyspark.sql import SparkSession


@pytest.fixture(scope="session")
def spark_session() -> Generator[SparkSession, Any, Any]:
    yield (
        SparkSession.builder.master("local")
        .appName("Testing PySpark Example")
        .getOrCreate()
    )

Then, it can be reused in tests/test_minimal_transo.py:

from pyspark.sql import SparkSession

def test_minimal_transfo(spark_session: SparkSession):

    df = spark_session.createDataFrame([(3, 4), (1, 2)], ["col1", "col2"])
    df.show()

You can again run pytest -k test_minimal_transfo -s to check the behavior has not changed. It's important in a test driven approach to keep launching the tests after code modification to ensure nothing was broken.

To be closer to the business context, you can implement a data transformation object. There will be a clear separation between data generation and data transformation. You can do so in src/data_transform.py

from pyspark.sql import DataFrame, SparkSession


class DataProcessor:
    def __init__(self, spark_session: SparkSession):
        self.spark_session = spark_session

    def run(self, persons: DataFrame, employments: DataFrame) -> DataFrame:
        raise NotImplementedError

Now, there is a prototype for DataProcessor, the tests can be improved to actually assert on elements like so in test_minimal_transfo.py

from pyspark.sql import DataFrame, SparkSession

from pyspark_tdd.data_processor import DataProcessor


def test_minimal_transfo(spark_session: SparkSession):
    persons = spark_session.createDataFrame(
        [
            (1, "George", "Washington", "1732-02-22"),
            (2, "Henry", "Ford", "1863-06-30"),
            (3, "Benjamin", "Franklin", "1706-01-17"),
            (4, "Martin", "Luther King Jr.", "1929-01-15"),
        ],
        ["id", "PersonalityName", "PersonalitySurname", "birth"],
    )
    employments = spark_session.createDataFrame(
        [
            (1, 1, "president"),
            (2, 2, "industrialist"),
            (3, 3, "inventor"),
            (4, 4, "minister"),
        ],
        ["id", "person_fk", "Employment"],
    )
    processor = DataProcessor(spark_session)
    df_out: DataFrame = processor.run(persons, employments)

    assert not df_out.isEmpty()
    assert set(df_out.columns) == set(
        ["name", "surname", "date_of_birth", "employment"]
    )

The example above will ensure that the data frame fits some criteria, but it will raise an NotImplementedError as you have to implement the actual data processing. It's intended, the actual processing code can be created after testing is properly setup.

The actual test is still not ideal as test case generation is part of the test itself. Pytest parametrization can be leveraged:

import pytest 

from pyspark.sql import SparkSession

@pytest.mark.parametrize(
    "persons,employments",
    [
        (
            (
                [
                    (1, "George", "Washington", "1732-02-22"),
                    (2, "Henry", "Ford", "1863-06-30"),
                    (3, "Benjamin", "Franklin", "1706-01-17"),
                    (4, "Martin", "Luther King Jr.", "1929-01-15"),
                ],
                ["id", "PersonalityName", "PersonalitySurname", "birth"],
            ),
            (
                [
                    (1, 1, "president"),
                    (2, 2, "industrialist"),
                    (3, 3, "inventor"),
                    (4, 4, "minister"),
                ],
                ["id", "person_fk", "Employment"],
            ),
        )
    ],
)
def test_minimal_transfo(spark_session: SparkSession, persons, employments):
    persons = spark_session.createDataFrame(*persons)
    employments = spark_session.createDataFrame(*employments)

    processor = DataProcessor(spark_session)
    df_out: DataFrame = processor.run(persons, employments)

    assert not df_out.isEmpty()
    assert set(df_out.columns) == set(
        ["name", "surname", "date_of_birth", "employment"]
    )

The above example show how test cases generation can be separated from test runs. It allows to see at first glance what this test is about without noise about test data. Most likely, the test data frames could be reused in another test, it needs to be refactored again. The test part becomes:

from pyspark.sql import DataFrame, SparkSession

from pyspark_tdd.data_processor import DataProcessor


def test_minimal_transfo(spark_session: SparkSession, persons: DataFrame, employments: DataFrame):
    processor = DataProcessor(spark_session)
    df_out: DataFrame = processor.run(persons, employments)

    assert not df_out.isEmpty()
    assert set(df_out.columns) == set(
        ["name", "surname", "date_of_birth", "employment"]
    )

and two fixtures persons and employments are created in tests/conftest.py:

@pytest.fixture(scope="session")
def persons(spark_session: SparkSession) -> Generator[DataFrame, Any, Any]:
    yield spark_session.createDataFrame(
        [
            (1, "George", "Washington", "1732-02-22"),
            (2, "Henry", "Ford", "1863-06-30"),
            (3, "Benjamin", "Franklin", "1706-01-17"),
            (4, "Martin", "Luther King Jr.", "1929-01-15"),
        ],
        ["id", "PersonalityName", "PersonalitySurname", "birth"],
    )


@pytest.fixture(scope="session")
def employments(spark_session: SparkSession) -> Generator[DataFrame, Any, Any]:
    yield spark_session.createDataFrame(
        [
            (1, 1, "president"),
            (2, 2, "industrialist"),
            (3, 3, "inventor"),
            (4, 4, "minister"),
        ],
        ["id", "person_fk", "Employment"],
    )

You can now relaunch pytest -k test_minimal_transfo -s and notice the NotImplementedError being raised; which is a good thing. The code has changed 3 times, yet the behavior remains the same, and the tests confirm it.

The real test case - version 1

Now that there is a proper testing in place, source code can be implemented. There could be variations of this, the intent here is not to provide the best source code, but the best way to test:

from pyspark.sql import DataFrame, SparkSession
from pyspark.sql.functions import to_date


class DataProcessor:
    def __init__(self, spark_session: SparkSession):
        self.spark_session = spark_session
        self.persons_rename = {
            "PersonalityName": "name",
            "PersonalitySurname": "surname",
            "birth": "date_of_birth",
        }
        self.employments_rename = {"Employment": "employment"}

    def run(self, persons: DataFrame, employments: DataFrame) -> DataFrame:
        persons = persons.withColumn(
            "birth", to_date(persons.birth)
        ).withColumnsRenamed(colsMap=self.persons_rename)

        employments = employments.withColumnRenamed(colsMap=self.employments_rename)
        joined = persons.join(
            employments, persons.id == employments.person_fk, how="left"
        )
        joined = joined.drop("id", "person_fk")
        return joined

If you rerun pytest -k test_minimal_transfo -s, then the test is successful.

What about ci?

A strong dependency to Java is now in place, running the tests in ci will depend on the ci having Java installed or not. This is an issue because it requires the developer to have a defined dev setup outside of the python ecosystem, there are extra steps for anyone to launch the tests.

Keep in mind, there is limited control over the developer setup, what if the Java already installed in the developer setup is not spark compliant? It will then be frustrating for the developer to investigate and most likely reinstall another Java version which might impact other projects. See the mess

Luckily, the ci runner on Github has Java installed for us; so the ci should run.

Clean up

You can now also clean up the repository to have a clean plate. For instance, src/pyspark_tdd/multiply.py and tests/test_dummy.py can be removed.

What's next

Now, you have a comfortable setup to modify and tweak the code. You can run the tests and be sure to reproduce.

In the next chapter, a more data driven approach to test case generation will be explored.

You can find the original materials in spark_tdd. This repository exposes what's the expected repository layout at the end of each chapter in each branch:

[09/03/25 UPDATE]: Chapter 4 has been released
[15/03/25 UPDATE]: Chapter 5 has been released

How to be Test Driven with Spark: Chapter 2 - CI

Nicoda-27 — Sun, 23 Feb 2025 10:21:54 +0000

This goal of this tutorial is to provide a way to easily be test driven with spark on your local setup without using cloud resources.

This is a series of tutorials and the initial chapters can be found in:

Chapter 0 and 1

Chapter 2: Continuous Integration (ci)

Having a ci is mandatory for any project that aims at having multiple contributors. In the following chapter, a proposal ci will be implemented.

As ci implementation is specific to a collaborative platform being Github, Gitlab, Bitbucket, Azure Devops etc. The following chapter will try to provide a technology agnostic ci as much as possible.

Similar concepts are available in all ci, you will have to transpose the concepts that will be used here.

Content of the ci

The ci here will be very minimal but showcases concepts that you implemented in Chapter 1, namely:

Python setup
Project setup
Code Formatting
Test automation

There are many more addition to the continuous integration that will not be tackled here. A minimal ci is required to guarantee non regressions in terms of:

code styling rules to guarantee no indivual contributors diverge from the coding style
tests, namely all tests must be passing

Implementation

Github provides extensive documentation for you to tweak your ci.

Github is expecting ci files to be provided at a specific location, you can therefore create a file in .github/workflows/ci.yaml.

In this file, you can add

name: Continuous Integration
run-name: Continuous Integration
on: [push]
jobs:
  Continuous-Integration:
    runs-on: ubuntu-latest

The name and run-name define the names of the pipeline that will run.
The on defines the event that will trigger the pipeline to run, push means that for every commit the pipeline will run.
The jobs defines a list of jobs, the ci is made of one job with multiple steps for the sake of simplicity.
The runs-on defined the docker image used to run (the runner) the environment against, it's a list of docker images maintained by Github.

Now into the steps section we can add:

steps:
  - name: Check out repository code
    uses: actions/checkout@v4
  - uses: jdx/mise-action@v2
  - name: Run Formatting
    run: |
      uv run ruff check
  - name: Run Tests
    run: |
      uv run pytest

The actions/checkout@v4 is the Github action that checkout the current branch of the repository.
The jdx/mise-action@v2 is the Github action that will read the mise.toml and install everything for us.
The Run Formatting step will install the dependencies and run the formatting. It there is an error, the command will fail and the pipeline too.
The Run Tests step will run the tests. It there is an error, the command will fail and the pipeline too.

Ci as documentation

As it was stated, the ci is the only source of truth. If it passes on ci, it should pass on your local setup. If not, it means there are discrepancies between the ci setup and yours.

Going through the ci implementation will help you on reproducibility. Maybe you're not using the same way to install python version, or the same dependency management tool. You need to align your tools and the ones presented in chapter 1 help not to conflict with your local setup. You might have installed python package globally or you might have manually changed PYTHON_HOME or your PATH and this can easily be a mess.

To help on reproducibility, a dev container approach can be used. It means, the ci will run inside a container and this container can be reused as a developer environment. This will not be implemented for the moment.

A better ci structure

To improve readability and segregates between code formatting and testing, Github actions can be implemented as job with interdependencies. Then, the workflow becomes:

name: Continuous Integration
run-name: Continuous Integration
on: [push]
jobs:
  Formatting:
    runs-on: ubuntu-latest
    steps:
      - name: Check out repository code
        uses: actions/checkout@v4
      - uses: jdx/mise-action@v2
      - name: Run Formatting
        run: |
          uv run ruff check
  Tests:
    runs-on: ubuntu-latest
    needs: [Formatting]
    steps:
      - name: Check out repository code
        uses: actions/checkout@v4
      - uses: jdx/mise-action@v2
      - name: Run Tests
        run: |
          uv run pytest

In here we added the needs: [Formatting] to create dependencies between ci job. It means, we will not run the tests until the code style is compliant; this will save some time and resources. Indeed, if the code is not formatted, don't even bother running the tests. The execution graph will be like:

We can see here some duplication, which is not ideal as for future code improvements, you will have to do it at two places at the same time. This is technical debt that one would have to tackle using composite action. We will consider it's ok for now.

Caching dependency resolution

You will see additional steps in the ci.yaml, namely related to cache

    - name: Restore uv cache
        uses: actions/cache@v4
        with:
          path: /tmp/.uv-cache
          key: uv-${{ runner.os }}-${{ hashFiles('uv.lock') }}
          restore-keys: |
            uv-${{ runner.os }}-${{ hashFiles('uv.lock') }}
            uv-${{ runner.os }}

These steps aim at caching the .venv when there are no changes on the uv.lock and reusing it. The intent is to speed up the ci execution as dependency resolution and installation can be time consuming.

An extra step to minimize caching size is added as mise proposes such feature, namely an extra step and an environment variable is added to configure the location of the cache.

      - name: Minimize uv cache
        run: uv cache prune --ci
    env:
      UV_CACHE_DIR: /tmp/.uv-cache

What's next

On the next chapter, you will implement your first spark code and implement a way to guarantee test automation of it. This is long overdue as we spent 3 chapters on setup...

You can find the original materials in spark_tdd. This repository exposes what's the expected repository layout at the end of each chapter in each branch:

[03/03/25 UPDATE]: Chapter 3 has been released
[09/03/25 UPDATE]: Chapter 4 has been released
[15/03/25 UPDATE]: Chapter 5 has been released

How to be Test Driven with Spark: Chapter 0 and 1 - Modern Python Setup

Nicoda-27 — Sat, 15 Feb 2025 09:24:09 +0000

Chapter 0: Why this tutorial

This goal of this tutorial is to provide a way to easily be test driven with spark on your local setup without using cloud resources.

Before deep diving into spark and how, we must first align on our setup environment to ease reproducibility; this will be the focus of this article.

The official documentation describes how to create tests with pyspark.

It requires to have spark server with a spark connect support for it to work as described in the documentation.

As a reminder, this is how spark connect works:

Namely, a specific server needs to be created so your tests can connect to this server and process the data as intended.

Why it is not enough?

Launching the server requires some extra requirements on your machine, namely a java virtual machine.
Launching the server requires a specific script called start-connect-server.sh which is to be found

Some data engineers might argue they can just use a spark server already deployed to be able to test; but there are several drawbacks to this approach:

You are being charged to launch simple tests or run experiments keeping cloud providers very happy
You slow down the developer feedback loop which is the time necessary to implement a feature and validates that no regression has been introduced. A developer is more confident to have no regression when tests are all executed
You create external dependencies that you have no control off. You might encounter issues with testing when the cloud provider is down, or you don't have internet access or someone changes the configuration of the server by accident.

The goal is to have a test environment that is self descriptive, quick to setup, quick to start and reliable.

Chapter 1: Setup

In this chapter, multiples tool will be introduced and setup. The intent is to have a clean python environment to reproduce the code. This is a very opinionated section, but it might be useful to challenge your existing tools with this section.

Python version management

Mise will be leveraged to handle python versions. It claims to be the The front-end to your dev env and it will be used to install specific versions of languages and tools.

It can be used for much more, and it is strongly advised to look at the documentation to understand the true power of this tool not limited to python developement.

Mise first needs to be installed, see documentation for further instructions. You can launch the following:

curl https://mise.run | sh

Once installed, you will have to customize your .bashrc or your .zhsrc (or other terminal support) to activate mise on your terminal.

echo 'eval "$(~/.local/bin/mise activate bash)"' >> ~/.bashrc

Mise can now be used to install python at a specific version with the following command:

mise install python@3.12

It will download a pre-compiled version of python and make it available globally.

Let's now use it, you first need to position yourself at the root of the project and launch:

mise use python@3.12

It will create a mise.toml file with the following section

[tools]
python = "3.12"

And a .python-version with the indication

3.12

With the help of these files, mise will be able to activate when located at the root of your project. It's also a great way to document other contributors of the requirements to launch this project without relying on README that becomes easily outdated.

Python dependency management

A tool to help us add, remove and download dependencies is necessary. Uv, will be used later on as it's very fast and easy to use.

To install it, the official documentation; but in this tutorial mise will be leveraged:

mise use uv

This will both install and setup uv for the project. See how [mise.toml]((https://github.com/Nicoda-27/spark_tdd/blob/doc/chapter_1/mise.toml) has been modified with the addition of:

[tools]
python = "3.12"
uv = "latest"

Now it can be used to initialize the project, namely:

uv init

This will create a folder structure for you and a hello.py. In this project, we have customized it a bit to add a tests section a pyspark_tdd package as part of src so it looks like:

.
├── src
│   ├── hello.py
├── tests
├── .python-version
├── .mise.toml
└── pyproject.toml

Ignoring files

Every repository needs a set of files to ignore before adding them to a commit. This is done via a .gitignore file and anyone can leverage existing templates for your language of preference.

If you start a project from scratch, you will need to first setup git

git init

Github maintains ignore files template for each language. You can leverage it with:

curl -L -o .gitignore https://raw.githubusercontent.com/github/gitignore/refs/heads/main/Python.gitignore

The chosen language for gitignore is in this project the python template.

Adding formatting and linting

Python

Linters and formatters are powerful tools to enforce code writing rules among developers. It takes away the pain of having to care how the code is written at the syntax level.

Ruff will be leveraged to format our python code as it's very powerful and can be run at file saves without latency.

Ruff will be added as a project dev dependency. A dev dependency is one that the project does not need to run, it can be related to tests, experimentation, formatting etc. Everything that is not meant to be shipped to production must be retained as a dev dependency to keep your python package as self contained as possible.

We can add ruff like so:

uv add ruff --dev

This will add a dev dependency in the pyproject.toml with

[dependency-groups]
dev = [
    "ruff>=0.8.4",
]

This will also create a .venv at the current working directory. You might notice that the .venv is ignored from git which is intended. Indeed, you don't want to commit your .venv directory as it's a copy of the dependencies of your project and can be quite extensive.

It will also create an uv.lock that documents your direct dependencies version and the indirect dependencies (the dependencies of your dependencies). This mechanism allows to segregates dependencies of your project from the rest.

Your project should now look like

.
├── .venv
├── src
│   ├── hello.py
├── tests
├── .python-version
├── .gitignore
├── .mise.toml
├── pyproject.toml
└── uv.lock

Other languages

As a project is not just python files, but also configuration, pipelines, documentation etc, formatting these files too is also necessary.

Documenting how these files will be formatted is done using editorconfig.
We will use the one from the editorconfig website.

Your Integrated Development Environment (IDE)

Whichever IDE will be used, it's very important that you setup formatting at file saves to save you time and remove the pain from handling it by hand.

If you are using VSCode, you can install the ruff extension and adjust the following to your settings.json

"editor.formatOnSave": true,
"[python]": {
    "editor.formatOnSave": true,
    "editor.codeActionsOnSave": {
        "source.fixAll": "explicit",
        "source.organizeImports": "explicit"
    },
    "editor.defaultFormatter": "charliermarsh.ruff"
},

The first test

To see if everything works as expected, you will write a very simple unit test. In a test driven approach, the test is written before the source code.

A test framework is required to launch the test automation, pytest will be used. You need to add it as a dev dependency

uv add pytest --dev

You can create a tests/test_dummy.py with the following code:

from your_python_package.multiply import multiply


def test_my_dummy_function():
    assert multiply(1, 2) == 2

This requires a function multiply that can be defined as in src/your_python_package/multiply.py:

 def multiply(a: int, b: int) -> int:
    return a * b

You can now run the tests, make sure you're using the right python from the .venv

which python

should display something like /$HOME/somepath/your_project/.venv/bin/python. If not, you can restart a new terminal, mise should be able to resolve.

Then run

pytest

Then it will display an error:

tests/test_dummy.py:1: in <module>
    from your_python_package.multiply import multiply
E   ModuleNotFoundError: No module named 'your_python_package'

You need to add an extra entry for pytest to detect the src layout. In pyproject.toml, you can add:

[tool.pytest.ini_options]
pythonpath = ["src"]

Now

pytest

should display

============================================================= test session starts ===============================================================
platform linux -- Python 3.12.8, pytest-8.3.4, pluggy-1.5.0
rootdir: /home/somepath/src/your_project
configfile: pyproject.toml
collected 1 item                                                                                                                                 

tests/test_dummy.py .                                                                                                                      [100%]

=============================================================== 1 passed in 0.01s ================================================================

You can do some housekeeping and remove the unnecessary src/your_python_package/hello.py.

You now have a proper setup to start working.

What's next

Now that one test is implemented, the continuous integration (ci) must be setup. In a collaborative way of working, the ci is the only source of truth to guarantee if everything is broken or not.

Notice we still have not touched upon any spark components, it's very important to have a clean reproducible codebase before diving.

That will be the topic of the next chapter.

You can find the original materials in spark_tdd. This repository exposes what's the expected repository layout at the end of each chapter in each branch:

[23/02/25 UPDATE]: Chapter 2 has been released
[03/03/25 UPDATE]: Chapter 3 has been released
[09/03/25 UPDATE]: Chapter 4 has been released
[15/03/25 UPDATE]: Chapter 5 has been released