Forem: iamtodor

Manipulating Complex Structures in BigQuery: A Guide to DDL Operations

iamtodor — Fri, 26 May 2023 07:41:17 +0000

This guide aims to provide a comprehensive understanding of handling changes in complex structures within BigQuery using Data Definition Language (DDL) statements. It explores scenarios involving top-level columns as well as nested columns, addressing limitations with the existing on_schema_change configuration in dbt for BigQuery.

Introduction
How-to
- Define schema
- Add records
- Add top-level field
- Change top-level field type
  - CAST
  - ALTER COLUMN
- Juggle with STRUCT
  - tmp table
  - update STRUCT using SET
  - CREATE OR REPLACE TABLE
Bonus notes
- Create a regular table
- Create an external table
Contact info

Introduction

Currently, dbt's on_schema_change configuration only tracks schema changes related to top-level columns in BigQuery. Nested column changes, such as adding, removing, or modifying a STRUCT, are not captured. This guide delves into extending the functionality of on_schema_change to encompass nested columns, enabling a more comprehensive schema change tracking mechanism. What exactly are top-level as well as nested ones I'm going to show further.

Moreover, it's important to note that BigQuery explicitly states on their Add a nested column to a RECORD column page that adding a new nested field to an existing RECORD column using a SQL DDL statement is not supported:

Adding a new nested field to an existing RECORD column by using a SQL DDL statement is not supported.

When it comes to drops one or more columns, from ALTER TABLE DROP COLUMN statement
page:

You cannot use this statement to drop the following:

Partitioned columns

Clustered columns

Nested columns inside existing RECORD fields

Columns in a table that has row access policies

This is the ongoing proposal with the discussion around
on_schema_change should handle non-top-level schema changes topic.

This limitation further highlights the need for alternative approaches to manipulate complex structures.

How-to

Define schema

To begin, let's dive into the SQL syntax and create the "person" table. This table will store information about individuals, including their ID, name, and address.



CREATE TABLE IF NOT EXISTS dataset_name.person (
    id INT64,
    name STRING,
    address STRUCT <
        country STRING,
        city STRING 
    >
)

Add records

Add a couple of records to the table.



INSERT INTO
    dataset_name.person (id, name, address)
VALUES
    (1, "John", STRUCT("USA", "New-York")),
    (2, "Jennifer", STRUCT("Canada", "Toronto"))

How schema in UI looks like

How data is represented while querying it with



SELECT
    *
FROM
    dataset_name.person

Add top-level field

Imagine we were tasked to add a new field has_car, that has an INT64 type.



ALTER TABLE
    dataset_name.person
ADD
    COLUMN IF NOT EXISTS has_car INT64;

-- add record right away
INSERT INTO
    dataset_name.person (id, name, has_car, address)
VALUES
    (3, "James", 0, STRUCT("USA", "New-York"))

When you add a new column to an existing BigQuery table, the past records will have null values for that newly added column. This behavior is expected because the new column was not present at the time those records were inserted.

Change top-level field type

Then your customer changes their mind and now the has_car column has to have a BOOL type instead of INT64. Here are 2 possible ways to tackle this task.

Before diving deep into the possible approaches, worth to mention, that BigQuery has Conversion rules, that you need to consider. For
instance, you can cast BOOL to INT64, but you cannot cast INT64 to DATETIME.

In BigQuery, CAST and ALTER COLUMN are two different approaches for modifying the data type of a column in a table.
Let's explore each approach:

`CAST`

The CAST() function is used to convert the data type of a column or an expression in a SQL query. It allows you to convert a column from one data type to another during the query execution. However, it does not permanently modify the data type of the column in the table's schema.

The following is an example of using CAST to convert a column's data type in a query:



SELECT
    id,
    name,
    address,
    CAST(has_car as BOOL) as has_car
FROM
    dataset_name.person

`ALTER COLUMN`

The ALTER COLUMN statement is used to modify the data type of a column in the table's schema. It allows you to permanently change the data type of a column in the table, affecting all existing and future data in that column.

Here's an example of using ALTER COLUMN to modify the data type of a column:



ALTER TABLE
    dataset_name.person
ALTER COLUMN
    has_car
SET
    DATA TYPE BOOL;

It's important to note that ALTER COLUMN is a DDL statement and can only be executed as a separate operation outside of a regular SQL query. Once the column's data type is altered, it will affect all future operations and queries performed on that table.

In summary, CAST is used to convert the data type of a column during query execution, while ALTER COLUMN is used to permanently modify the data type of a column in the table's schema. The choice between the two depends on whether you want to temporarily convert the data type for a specific query or permanently change the data type for the column in the table.

Juggle with STRUCT

If we want to apply changes to nested fields, such as adding, removing, or modifying STRUCT itself there are few different ways to do so.

temp table

First, quite simple is using the temp table.



CREATE TABLE IF NOT EXISTS dataset_name.person_tmp (
    id INT64,
    name STRING,
    has_car INT64,
    address STRUCT <
        country STRING,
        city STRING,
        zip_code INT64
    >
);

-- fill then new zip_code field with the default 0 value
INSERT INTO
    dataset_name.person_tmp
SELECT
    id,
    name,
    has_car,
    (
        SELECT
            as STRUCT address.country,
            address.city,
            0 as zip_code
    ) as address
FROM
    dataset_name.person;

ALTER TABLE
    IF EXISTS `dataset_name.person` RENAME TO `person_past`;

ALTER TABLE
    IF EXISTS `dataset_name.person_tmp` RENAME TO `person`;

DROP TABLE dataset_name.person_tmp;

However, this approach has some drawbacks and considerations to keep in mind: when modifying a BigQuery table using a temporary table, you need to create a new table with the desired modifications and then copy the data from the original table to the temporary table.

Costs

As this process involves duplicating the data. It will increase storage usage, leading to additional storage costs as well as it consumes additional query processing resources.

Performance

It may impact performance, especially for large tables as you have a limited amount of production resources that are shared.

Complexity and consistency

Using a temporary table to modify a BigQuery table introduces additional steps and complexity to the process. You need to write queries to create the temporary table, copy data, modify the data, overwrite the original table, and then drop the temporary table. This adds complexity to the overall workflow and may require more code and query execution time.

Last, but not least, during the modification process, there might be a period where the original table is not accessible or is in an inconsistent state. If other processes or applications depend on the original table's data, this downtime or inconsistency could impact their operations.

So this is not the very best way.

update STRUCT using SET

Another scenario is to change the nested field type. Imagine we would like to update the zip_code type from STRING to INT64. Now we don't want to use the tmp table way. So the second way is to UPDATE STRUCT using



ALTER TABLE
    dataset_name.person
ADD
    COLUMN IF NOT EXISTS address_new STRUCT < country STRING,
    city STRING,
    zip_code STRING >;

UPDATE
    `dataset_name.person`
SET
    address_new = (
        SELECT
            AS STRUCT address.country,
            address.city,
            CAST(address.zip_code as STRING)
    )
WHERE
    TRUE;

ALTER TABLE
    dataset_name.person RENAME COLUMN address TO address_past;

ALTER TABLE
    dataset_name.person RENAME COLUMN address_new TO address;

ALTER TABLE
    dataset_name.person DROP COLUMN address_past;

In this case, only the STRUCT field will be duplicated. That is good enough.

CREATE OR REPLACE TABLE

Another last approach is using CREATE OR REPLACE TABLE.



CREATE
OR REPLACE TABLE dataset_name.person AS
SELECT
    id,
    name,
    has_car,
    (
        SELECT
            AS STRUCT address.country,
            address.city,
            CAST(address.zip_code as STRING)
    ) as address
FROM
    dataset_name.person

In the same way, we can remove nested fields. We can just select the needed fields and omit the ones we don't interested in.



SELECT
    address.*
FROM
    `dataset_name.person`;

SELECT
    * REPLACE (
        (
            SELECT
                AS STRUCT address.*
            EXCEPT
                (zip_code)
        ) AS address
    )
FROM
    `dataset_name.person`

Bonus notes

If you have some table schema from a separate dataset, that you need to create in your particular dataset the easiest the way is using CLI commands as it's a much faster and less error-prone way to create tables.

Create a regular table

This is the example of how to save table schema using Table ID to JSON format with bq show command



bq show \
    --schema \
    --format=prettyjson \
    project_name:dataset_name.table_name > table_name.json

And now you can create a table in your dataset using bq mk command:



bq mk \
    --table \
    your_dataset_name.table_name \
    table_name.json

Create an external table

Here is the example of creating a table definition in JSON format using bq mkdef:



bq mkdef \
    --source_format=NEWLINE_DELIMITED_JSON \
    --autodetect=false \
    'gs://bucket_name/prefix/*.json' > table_def

The mkdef command is to create a table definition in JSON format for data stored in Cloud Storage or Google Drive. It will be used to create an external table.



bq mk \

    --table \

    --external_table_definition=nicereply_csat_raw_def \

    dataset_name.table_name \

    table_name.json

Contact info

If you found this article helpful, I invite you to connect with me on LinkedIn. I am always looking to expand my network and connect with like-minded individuals in the data industry. Additionally, you can also reach out to me for any questions or feedback on the article. I'd be more than happy to engage in a conversation and help out in any way I can. So don’t hesitate to contact me, and let’s connect and learn together.

Python's BigQuery External CVS tables and null_marker challenge

iamtodor — Mon, 22 May 2023 13:38:03 +0000

Testing data pipelines is crucial for ensuring their efficiency and reliability. When dealing with data sources like BigQuery external tables, it becomes necessary to emulate these data sources to perform thorough pipeline testing.

So idea is to create external CSV table using Python's BigQuery library.

For additional context, we treat empty value as "" and null as \N.

Unfortunately, this library does not allow you to specify null_marker for CSV files while creating an external CSV table using ExternalConfig

The null_marker option is not presented in CsvOptions, in contrast to, for example, skip_leading_rows.

external_config = ExternalConfig(ExternalSourceFormat.CSV)
external_config.options.skip_leading_rows = 1
external_config.options.null_marker = '\\N' # does not work

Simultaneously, it's important to note that while a Data Definition Language (DDL) statement that includes the null_marker option may work successfully when executed directly in the BigQuery console, it might not function as expected when submitted using the Python client.

CREATE
OR REPLACE EXTERNAL TABLE project_name.dataset_name.table_name (
    `field1` STRING,
    `field2` STRING,
) OPTIONS (
    uris = ['gs://bucket_name/prefix_name/file_name.csv'],
    format = CSV,
    skip_leading_rows = 1,
    null_marker = '\\N'
)

If you execute this query in the BigQuery console, your table will be created successfully and ready for querying. However, if you attempt to submit the same query using the Python library, you will encounter the following exception:

>>> bq_client.query(query=create_table_ddl).result()
google.api_core.exceptions.BadRequest: 400 Syntax error: Illegal escape sequence: \N

Even when attempting to shield the \\\\N sequence, it may not produce the expected outcome. The actual null values are not processed as intended; instead, they remain as \\N rather than being interpreted as null.

If you have insights or solutions regarding the challenge of handling null values in emulated tables, kindly share them in the comments section. This will help improve the guide and provide a comprehensive solution for others facing a similar issue.

The Practical Guide to Utilizing DBT Packages for Data Transformation

iamtodor — Thu, 12 Jan 2023 10:18:36 +0000

Table of content

What are packages
Why use it
Local packages
Dbt hub packages
Verify packages are installed
Macros usage
Models usage
dbt_modules under the hood
Disclaimer
Contact

What are packages

dbt packages are collections of macros, models, and other resources that are used to extend the functionality of dbt. Packages can be used to share common code and resources across multiple dbt projects, and can be published and installed from the dbt Hub, from GitHub or can be stored locally and installed by specifying the path to the project.

In dbt, libraries like these are called packages.

Why use it

dbt packages are so powerful because so many of the analytic problems we encountered are shared across organizations.

There are a few general benefits to using packages in dbt:

Reusability: packages allow you to reuse code across multiple projects and models. This can save you a lot of time and effort, as you don't have to copy and paste the same code into multiple places.
Collaboration: packaging your models in a package allows multiple people to work on the same models at the same time. You can use version control systems like git to manage changes to the models, and use tools like the dbt test command to ensure that the models are correct and reliable.
Sharing: packaging your models or macros in a package allows you to share them with others. You can publish your package on the dbt Hub or on GitHub, and others can install and use your models in their own dbt projects.
Managing: packages make it easier to manage your codebase. You can use version control to track changes to your package, and you can easily install and update packages in your dbt project.
Modularity: packaging your models in a package allows you to break your data pipeline into smaller, more manageable pieces, which are easier to understand and maintain. This could make it streamline development and upkeep your dbt project over time. This is especially useful if you are working on a large project with many different models and transformations.

Overall, using packages can help you to build more efficient, maintainable, and scalable data pipelines with dbt.

For example, if your aim is to extract the day of the week, there is no sense to reinvent the wheel and develop this macro on your own. Rather, we might want to find the right package and make use of it {{ dbt_date.day_of_week(column_name) }}.

Local packages

In dbt, you can use local packages to organize and reuse code within a single dbt project. Local packages are stored within your project directory and are only available to the models in that project. The best use-case for local packages is some module that you want to live in the same repository, nearby to the main project.

To create a reusable local package do the following:

Consider you have the following dbt project dir structure

>>> tree -L 1 .
.
├── data
├── dbt_project.yml
├── macros
├── models
├── packages
├── packages.yml
├── profiles.yml
└── target

Create packages dir, so here we would put our first local package.

mkdir packages

Create packages.yml file, so here we would link our first local package.

touch packages.yml

Before moving on please verify that you have the following dir structure

>>> tree -L 1 .
.
├── data
├── dbt_modules
├── dbt_project.yml
├── macros
├── models
├── packages
├── packages.yml
├── profiles.yml
├── snapshots
└── target

Jump into packages dir and init your package with the name local_utils. The name of package is arbitrary.

cd packages
dbt init local_utils

It will create a package with the following structure:

>>> tree local_utils
local_utils
├── README.md
├── analysis
├── data
├── dbt_project.yml
├── macros
├── models
│   └── example
│       ├── my_first_dbt_model.sql
│       ├── my_second_dbt_model.sql
│       └── schema.yml
├── snapshots
└── tests

Next, you need to change the project name in dbt_project.yml from my_new_project to a meaningful and self-explainable name. This name will be used further as macro or model references. Let's call it local_utils.
Specify our local package in the before-mentioned packages.yml as follows:

packages:
  - local: /opt/dbt/packages/local_utils/

Make sure that you provide your absolute path to the packages. Otherwise, it would not work.

Save the packages.yml file and run the dbt deps command to install the package. This will link the package and make it available to your dbt models.

Here is an example of what the dbt deps command might look like when you install your local package:

>>> dbt deps
Running with dbt=0.21.1
Installing /opt/dbt/packages/local_utils/
  Installed from <local @ /opt/dbt/packages/local_utils/>

Now you can observe a newly created dbt_modules dir, that contains binary file local_utils. It means than our local package is ready to be used.

>>> tree dbt_modules
dbt_modules
└── utils -> /opt/dbt/packages/local_utils/

Dbt hub packages

In dbt, you can use packages from the dbt Hub to share your code with others and to reuse code from other users in your own projects. The dbt Hub is a community-driven library of packages that you can use to extend the functionality of dbt.

Probably, the best examples of third-party packages driven by the community would be:

There are a few benefits to using dbt Hub packages:

Reusable code: dbt Hub packages allow you to reuse code that has been shared by other users, teams and companies. This can save you a lot of time and effort, as you don't have to write the same logic from scratch, test and maintain it.
The major advantage of any open-source:

Community support: When you use packages from the dbt Hub, you can benefit from the support and expertise of the dbt community. If you have questions or run into issues with a package, you can ask for help on the dbt community forums or Slack channel.
Collaboration: By sharing your own packages on the dbt Hub, you can make your code available to other users. This can help to foster collaboration and improve the overall quality of your code.

Overall, using dbt Hub packages can help you to build more efficient, maintainable, and scalable data pipelines with dbt, and to collaborate with others in the dbt community.

To install a package from the dbt Hub in your dbt project, you will need to add the package to your packages.yml file.

Here is the basic process:

Go to the dbt Hub and search for the package you want to install.
Click on the package to view its details.
Copy the package name and version from the installation instructions.
Open your packages.yml file and add the package name and version to the packages list. It should look something like this:

packages:
  - name: dbt-labs/dbt_utils
    version: 0.7.6

Save the packages.yml file and run the dbt deps command to install the package. This will download the package and make it available to your dbt models.

Here is an example of what the dbt deps command might look like when you install dbt hub package:

>>> dbt deps
Installing dbt-labs/dbt_utils@0.7.6
  Installed from version 0.7.6

Unlike of local package, hub package was downloaded to dbt_modules dir physically.

>>> tree -L 2 dbt_modules
dbt_modules
└── dbt_utils
    ├── CHANGELOG.md
    ├── LICENSE
    ├── README.md
    ├── RELEASE.md
    ├── dbt_project.yml
    ├── docker-compose.yml
    ├── etc
    ├── integration_tests
    ├── macros
    └── run_test.sh

Verify packages are installed

To verify that a package is installed in your dbt project, you can check the packages.yml file and run the dbt deps command.

Check the packages.yml file: This file lists all of the packages that are installed in your dbt project. Look for the name of the package you want to verify. If it is listed in the packages list, then it is installed.
Run the dbt deps command:
1. This command will show you a list of all of the packages that are installed in your dbt project. Look for the name of the package you want to verify. If it is listed, then it is installed.
2. In the root dbt project dir, you observe a new dir dbt_modules/ which contains the compiled packages that are ready to be used. NOTE: dir dbt_modules/ has to be added to .gitignore.

>>> tree -L 1 .
.
├── data
├── dbt_modules
├── dbt_project.yml
├── macros
├── models
├── packages
├── packages.yml
├── profiles.yml
├── snapshots
└── target

If your packages.yml file contains package that is not installed then you would not be able to run any dbt command:

>>> dbt list
Encountered an error:
Compilation Error
  dbt found 1 package(s) specified in packages.yml, but only 0 package(s) installed in dbt_modules. Run dbt deps to install package dependencies.

So this is our guarantee that in runtime we would not have any issues related to the package installation.

Macros usage

In dbt, you can use packages to define custom macros that can be called from your dbt models. Here is an example of how you might use a package to define a custom macro.

Here are a few examples of how you might use macros in dbt to perform common data transformations.

For instance, lets create the following macros in our local package under local_utils/macros/cents_to_dollars.sql

{% macro cents_to_dollars(column_name, precision=2) %}
    ({{ column_name }} / 100)::numeric(16, {{ precision }})
{% endmacro %}

Next we can call our macros as {{ local_utils.cents_to_dollars(your_column_name) }}. The local_utils package names comes from the name in our package dbt_project.yml file.

Usage a macros from dbt hub packages is pretty much the same. Imagine we want to generate a surrogate key based on a few columns. This is the functional dbt-utils we previously installed provides: {{ dbt_utils.generate_surrogate_key([field_a, field_b[,...]]) }}

So the macros usage pattern from the third-party package is {{ package_name.macros_name() }}.

Models usage

As we created our own local package local_utils as a prerequisite, which has the following structure:

>>> tree packages/local_utils
local_utils
├── README.md
├── analysis
├── data
├── dbt_project.yml
├── macros
├── models
│   └── example
│       ├── my_first_dbt_model.sql
│       ├── my_second_dbt_model.sql
│       └── schema.yml
├── snapshots
└── tests

The most important function in dbt is ref(); its impossible to build even moderately complex models without it. ref() is how you reference one model within another inside your package. Here is how this looks in practice:

select
    *
from
    {{ref("model_a")}}

So if we would like to reference my_first_dbt_model from my_second_dbt_model within local_utils package then we do the following:

select
    *
from
    {{ ref("my_first_dbt_model") }}

If we want to reference my_first_dbt_model from our main project then we need to slightly change the way we call it. There is also a two-argument variant of the ref function. With this variant, you can pass both a package name and model name to ref to avoid ambiguity:

select
    *
from
    {{ ref("package_name", "model_name") }}

Our particular case would be as:

select
    *
from
    {{ ref("local_utils", "my_first_dbt_model") }}

Note: The package_name should only include the name of the package, not the maintainer. For example, if we use the dbt-labs/dbt-utils package, type dbt-utils in that argument, and not dbt-labs/dbt-utils.

dbt_modules under the hood

The dbt_modules directory is a directory that is used by dbt to store packages and their models. When you install a package using the dbt deps command, the package and its models are downloaded and stored in the dbt_modules directory.

The dbt_modules directory is located in the root directory of your dbt project. It contains subdirectories for each installed package, and each package directory contains the packages models, macros, and other resources.

The way dbt installs local and dbt hub packages is different.

Considering to have the following package.yml content:

packages:
  - package: dbt-labs/dbt_utils
    version: 0.7.6
  - local: /opt/dbt/packages/local_utils/

You would have the following modules under generated dbt_modules dir:

>>> tree -L 2 dbt_modules
dbt_modules
├── dbt_utils
│   ├── CHANGELOG.md
│   ├── LICENSE
│   ├── README.md
│   ├── RELEASE.md
│   ├── dbt_project.yml
│   ├── docker-compose.yml
│   ├── etc
│   ├── integration_tests
│   ├── macros
│   └── run_test.sh
└── utils -> /opt/dbt/packages/local_utils/

As I mentioned before, it creates a symlink for local packages, and for the dbt hub package, it simply copies all the needed files in the same name folder.

We use Google Cloud Composer to orchestrate all the transformation jobs. We basically copy our project to GCP bucket with gsutil -m rsync. Unfortunately, it does not support symbolic links:

Since gsutil rsync is intended to support data operations (like moving a data set to the cloud for computational processing) and it needs to be compatible both in the cloud and across common operating systems, there are no plans for gsutil rsync to support operating system-specific file types like symlinks.

Taken from gutils rsync's documentation Be Careful When Synchronizing Over Os-Specific File Types (Symlinks, Devices, Etc.)

The possible solution is to compress everything locally to archive, copy it to the bucket, and then unpack it to composer:

tar -czvf dbt-project.tar.gz dbt-project
gsutil -m rsync dbt-project.tar.gz gs://$BUCKET/prefix

Here’s what those switches actually mean:

-c: Create an archive.
-z: Compress the archive with gzip.
-v: Display progress in the terminal while creating the archive, also known as “verbose” mode. The v is always optional in these commands, but it’s helpful.
-f: Allows you to specify the filename of the archive.

These are pitfalls that we met when working with dbt packages.

Disclaimer

All this experience applies to dbt v0.21.1
I am aware of since v1.0 they changed the default value to dbt_packages instead of dbt_modules
I like to think that most of the guide still appears to be applicable to the latest dbt version

Contact

Configuring python linting to be part of CI/CD using GitHub actions

iamtodor — Thu, 15 Sep 2022 06:44:07 +0000

Hello everyone, I am a DataOps Engineer at FreshBooks. In this article I would like to share my experience on configuration best practices for GitHub actions pipelines for linting.

Freshbooks DataOps team has a linter configuration that developers can run before submitting a PR. We had an idea to integrate lint checks into our regular CI/CD pipeline. This adoption would eliminate potential errors, bugs, stylistic errors. We will basically enforce the common code style across the team.

FreshBooks uses GitHub as a home for our code base, so we would like to use it as much as possible. Recently I finished this configuration so the linter and its checks are now part of a GitHub actions CI/CD workflow.

This article has two major parts: the first one is linter configuration, and the second one is GitHub workflow configuration itself. Feel free to read all the parts, or skip some and jump into specific one you are interested in.

Linters configuration
- Disable unwanted checks
- Documentation
- Import error
- Tweaks for airflow code
GitHub workflow actions CI/CD configurations
- When to run it
- What files does it run against
- Run linter itself
Conclusion
Contact

Linters configuration

Here are the linters and checks we are going to use:

Disclaimer: author assumes you are familiar with the above-mentioned linters, tools, and checks.

I would like to share how to configure them for the python project. I prepared a full github actions python configuration demo repository.

We use flakeheaven as a flake8 wrapper, which is very easy to configure in one single pyproject.toml. The whole pyproject.toml configuration file can be found in
a demo repo.

I would say the config file is self-explainable, so I will not stop here for long. Just a few notes about tiny tweaks.

Disable unwanted checks

A few checks that we don't want to see complaints about:

Documentation

The default flakeheaven configuration assumes every component is documented.



>>> python -m flakeheaven lint utils.py

utils.py
     1:   1 C0114 Missing module docstring (missing-module-docstring) [pylint]
  def custom_sum(first: int, second: int) -> int:
  ^
     1:   1 C0116 Missing function or method docstring (missing-function-docstring) [pylint]
  def custom_sum(first: int, second: int) -> int:
  ^
     5:   1 C0116 Missing function or method docstring (missing-function-docstring) [pylint]
  def custom_multiplication(first: int, second: int) -> int:

We are ok if not every module will be documented. We are also ok if not every function or method will be documented. We are not going to push documentation for documentation's sake. So we want to disable C0114 and C0116 checks from pylint.

Import error

Our linter requirements live in a separate file and we don't aim to mix it with our main production requirements. Hence, linter would complain about import libraries as linter env does not have production libraries, quite obvious.



>>> python -m flakeheaven lint . 

dags/dummy.py
     3:   1 E0401 Unable to import 'airflow' (import-error) [pylint]
  from airflow import DAG
  ^
     4:   1 E0401 Unable to import 'airflow.operators.dummy_operator' (import-error) [pylint]
  from airflow.operators.dummy_operator import DummyOperator
  ^

So we need to disable E0401 check from pylint.

We assume that the developer who writes the code and imports the libs is responsible for writing reliable tests. So if the test does not pass it means that it's something with the import or code (logic) itself. Thus, the import check is not something we would like to put as a linter job.

Also, there is another possible solution to disable this check by including # noqa: E0401 after the import statement.



from airflow import DAG  # noqa: E0401
from airflow.operators.dummy_operator import DummyOperator  # noqa: E0401

Tweaks for airflow code

To configure code for Airflow DAGs there are also a few tweaks. Here is the dummy example dummy.py.

If we run flakeheaven with the default configuration we would see the following error:



>>> python -m flakeheaven lint .                                                       

dags/dummy.py
    17:   9 W503 line break before binary operator [pycodestyle]
  >> dummy_operator_2
  ^
    18:   9 W503 line break before binary operator [pycodestyle]
  >> dummy_operator_3
  ^
    19:   9 W503 line break before binary operator [pycodestyle]
  >> [dummy_operator_4, dummy_operator_5, dummy_operator_6, dummy_operator_7]
  ^

However, we want to keep each task specified in a new line, hence we need to disable W503 from pycodestyle.

Next, with the default configuration we would get the next warning:



>>> python -m flakeheaven lint .                                                       

dags/dummy.py
    15:   5 W0104 Statement seems to have no effect (pointless-statement) [pylint]
  (
  ^

This is about how we specify task order. The workaround here is to exclude W0104 from pylint.

More info about rules could be found on flake8 rules page.

GitHub workflow actions CI/CD configurations

Disclaimer: author assumes you are familiar with GitHub actions.

We configure GitHub Workflow to be triggered on every PR against the main (master) branch.

The whole py_linter.yml config can be found in a demo repo. I will walk you through it step by step.

When to run it

We are interested in running linter only when a PR has .py files. For instance, when we update README.md there is no sense in running a python linter.

What files does it run against

We are interested in running a linter only against the modified files. Let's say, we take a look at the provided repo, if I update dags/dummy.py I don't want to waste time and resources running the linter against main.py. For this purpose we use Paths Filter GitHub Action, which is very flexible.

If we have modified a .py file and any other files such as .toml in one PR, we don't want to run a linter against the non-python files, so we configure filtering only for .py files no matter the location: root, tests, src, etc.

The changed file can have the following statuses: added, modified, or deleted. There is no reason to run the linter against deleted files as your workflow would simply fail, because that particular changed file is no longer in the repo. So we need to configure what changes we consider triggering the linter.

I define the variable where I can find the output (only the .py files) from the previous filter. This variable would contain modified .py files that I can further pass to a flakeheaven, black, and isort. By default, the output is disabled and "Paths Changes Filter" allows you to customize it: you can list the files in .csv, .json, or in a shell mode. Linters accept files separated simply by space, so our choice here is shell mode.

Run linter itself

The next and last step is to run the linter itself.

Before we run the linter on changed files we run a check to see if there are actual changes in .py files by checking if there are any .py files from the previous step.

Next, using the before-mentioned output variable we can safety pass the content from this steps.filter.outputs.py_scripts_filter_files variable to linter.

Conclusion

That's all I would like to share. I hope it is useful for you, and that you can utilize this experience and knowledge.

I wish you to see these successful checks every time you push your code :)

If you have any questions feel free to ask in a comment section, I will do my best to provide a comprehensive answer for you.

Question to you: do you have linter checks as a part of your CI/CD?