<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: iamtodor</title>
    <description>The latest articles on Forem by iamtodor (@iamtodor).</description>
    <link>https://forem.com/iamtodor</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F921293%2F7574d5de-3e07-40c9-b249-101aefad60fe.jpeg</url>
      <title>Forem: iamtodor</title>
      <link>https://forem.com/iamtodor</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/iamtodor"/>
    <language>en</language>
    <item>
      <title>Manipulating Complex Structures in BigQuery: A Guide to DDL Operations</title>
      <dc:creator>iamtodor</dc:creator>
      <pubDate>Fri, 26 May 2023 07:41:17 +0000</pubDate>
      <link>https://forem.com/freshbooks/manipulating-complex-structures-in-bigquery-a-guide-to-ddl-operations-dhm</link>
      <guid>https://forem.com/freshbooks/manipulating-complex-structures-in-bigquery-a-guide-to-ddl-operations-dhm</guid>
      <description>&lt;p&gt;This guide aims to provide a comprehensive understanding of handling changes in complex structures within BigQuery using Data Definition Language (DDL) statements. It explores scenarios involving top-level columns as well as nested columns, addressing limitations with the existing &lt;code&gt;on_schema_change&lt;/code&gt; configuration in dbt for BigQuery.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Introduction&lt;/li&gt;
&lt;li&gt;
How-to

&lt;ul&gt;
&lt;li&gt;Define schema&lt;/li&gt;
&lt;li&gt;Add records&lt;/li&gt;
&lt;li&gt;Add top-level field&lt;/li&gt;
&lt;li&gt;
Change top-level field type

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;CAST&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;ALTER COLUMN&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

Juggle with STRUCT

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;tmp table&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;update STRUCT using SET&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;CREATE OR REPLACE TABLE&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;/li&gt;

&lt;li&gt;

Bonus notes

&lt;ul&gt;
&lt;li&gt;Create a regular table&lt;/li&gt;
&lt;li&gt;Create an external table&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;Contact info&lt;/li&gt;

&lt;/ul&gt;

&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;Currently, dbt's &lt;code&gt;on_schema_change&lt;/code&gt; configuration only tracks schema changes related to top-level columns in BigQuery. Nested column changes, such as adding, removing, or modifying a &lt;code&gt;STRUCT&lt;/code&gt;, are not captured. This guide delves into extending the functionality of &lt;code&gt;on_schema_change&lt;/code&gt; to encompass nested columns, enabling a more comprehensive schema change tracking mechanism. What exactly are &lt;code&gt;top-level&lt;/code&gt; as well as &lt;code&gt;nested&lt;/code&gt; ones I'm going to show further.&lt;/p&gt;

&lt;p&gt;Moreover, it's important to note that BigQuery explicitly states on their &lt;a href="https://cloud.google.com/bigquery/docs/managing-table%20schemas#add_a_nested_column_to_a_record_column" rel="noopener noreferrer"&gt;Add a nested column to a RECORD column page&lt;/a&gt; that adding a new nested field to an existing RECORD column using a SQL DDL statement is not supported:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Adding a new nested field to an existing RECORD column by using a SQL DDL statement is not supported.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;When it comes to drops one or more columns, from &lt;a href="https://cloud.google.com/bigquery/docs/reference/standard-sql/data-definition-language#alter_table_drop_column_statement" rel="noopener noreferrer"&gt;ALTER TABLE DROP COLUMN statement&lt;/a&gt;&lt;br&gt;
page:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;You cannot use this statement to drop the following:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Partitioned columns&lt;/li&gt;
&lt;li&gt;Clustered columns&lt;/li&gt;
&lt;li&gt;Nested columns inside existing RECORD fields&lt;/li&gt;
&lt;li&gt;Columns in a table that has row access policies&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;

&lt;p&gt;This is the ongoing proposal with the discussion around&lt;br&gt;
&lt;a href="(https://github.com/dbt-labs/dbt-bigquery/issues/446)"&gt;on_schema_change should handle non-top-level schema changes&lt;/a&gt; topic.&lt;/p&gt;

&lt;p&gt;This limitation further highlights the need for alternative approaches to manipulate complex structures.&lt;/p&gt;

&lt;h2&gt;
  
  
  How-to
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Define schema
&lt;/h3&gt;

&lt;p&gt;To begin, let's dive into the SQL syntax and create the "person" table. This table will store information about individuals, including their ID, name, and address.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;

&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;IF&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;EXISTS&lt;/span&gt; &lt;span class="n"&gt;dataset_name&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;person&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="n"&gt;INT64&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="n"&gt;STRING&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;address&lt;/span&gt; &lt;span class="n"&gt;STRUCT&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt;
        &lt;span class="n"&gt;country&lt;/span&gt; &lt;span class="n"&gt;STRING&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;city&lt;/span&gt; &lt;span class="n"&gt;STRING&lt;/span&gt; 
    &lt;span class="o"&gt;&amp;gt;&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h3&gt;
  
  
  Add records
&lt;/h3&gt;

&lt;p&gt;Add a couple of records to the table.&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;

&lt;span class="k"&gt;INSERT&lt;/span&gt; &lt;span class="k"&gt;INTO&lt;/span&gt;
    &lt;span class="n"&gt;dataset_name&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;person&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;address&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;VALUES&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;"John"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;STRUCT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;"USA"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;"New-York"&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;"Jennifer"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;STRUCT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;"Canada"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;"Toronto"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;How schema in UI looks like&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6stdl5uw34qnnj1bg0bt.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6stdl5uw34qnnj1bg0bt.png" alt="table schema"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;How data is represented while querying it with&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;

&lt;span class="k"&gt;SELECT&lt;/span&gt;
    &lt;span class="o"&gt;*&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt;
    &lt;span class="n"&gt;dataset_name&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;person&lt;/span&gt;


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzzu6olu1voyc8u6vgpzn.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzzu6olu1voyc8u6vgpzn.png" alt="data representation"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Add top-level field
&lt;/h3&gt;

&lt;p&gt;Imagine we were tasked to add a new field &lt;code&gt;has_car&lt;/code&gt;, that has an &lt;code&gt;INT64&lt;/code&gt; type.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;

&lt;span class="k"&gt;ALTER&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt;
    &lt;span class="n"&gt;dataset_name&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;person&lt;/span&gt;
&lt;span class="k"&gt;ADD&lt;/span&gt;
    &lt;span class="k"&gt;COLUMN&lt;/span&gt; &lt;span class="n"&gt;IF&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;EXISTS&lt;/span&gt; &lt;span class="n"&gt;has_car&lt;/span&gt; &lt;span class="n"&gt;INT64&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- add record right away&lt;/span&gt;
&lt;span class="k"&gt;INSERT&lt;/span&gt; &lt;span class="k"&gt;INTO&lt;/span&gt;
    &lt;span class="n"&gt;dataset_name&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;person&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;has_car&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;address&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;VALUES&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;"James"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;STRUCT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;"USA"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;"New-York"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;When you add a new column to an existing BigQuery table, the past records will have null values for that newly added column. This behavior is expected because the new column was not present at the time those records were inserted.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0zb0gso60q6abs3m0k57.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0zb0gso60q6abs3m0k57.png" alt="null value for past records"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Change top-level field type
&lt;/h3&gt;

&lt;p&gt;Then your customer changes their mind and now the &lt;code&gt;has_car&lt;/code&gt; column has to have a &lt;code&gt;BOOL&lt;/code&gt; type instead of &lt;code&gt;INT64&lt;/code&gt;. Here are 2 possible ways to tackle this task.&lt;/p&gt;

&lt;p&gt;Before diving deep into the possible approaches, worth to mention, that BigQuery has &lt;a href="https://cloud.google.com/bigquery/docs/reference/standard-sql/conversion_rules" rel="noopener noreferrer"&gt;Conversion rules&lt;/a&gt;, that you need to consider. For&lt;br&gt;
instance, you can cast &lt;code&gt;BOOL&lt;/code&gt; to &lt;code&gt;INT64&lt;/code&gt;, but you cannot cast &lt;code&gt;INT64&lt;/code&gt; to &lt;code&gt;DATETIME&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;In BigQuery, &lt;code&gt;CAST&lt;/code&gt; and &lt;code&gt;ALTER COLUMN&lt;/code&gt; are two different approaches for modifying the data type of a column in a table.&lt;br&gt;
Let's explore each approach:&lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;code&gt;CAST&lt;/code&gt;
&lt;/h4&gt;

&lt;p&gt;The &lt;code&gt;CAST()&lt;/code&gt; function is used to convert the data type of a column or an expression in a SQL query. It allows you to convert a column from one data type to another during the query execution. However, it does not permanently modify the data type of the column in the table's schema.&lt;/p&gt;

&lt;p&gt;The following is an example of using &lt;code&gt;CAST&lt;/code&gt; to convert a column's data type in a query:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;

&lt;span class="k"&gt;SELECT&lt;/span&gt;
    &lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;address&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;CAST&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;has_car&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="nb"&gt;BOOL&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;has_car&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt;
    &lt;span class="n"&gt;dataset_name&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;person&lt;/span&gt;


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h4&gt;
  
  
  &lt;code&gt;ALTER COLUMN&lt;/code&gt;
&lt;/h4&gt;

&lt;p&gt;The &lt;code&gt;ALTER COLUMN&lt;/code&gt; statement is used to modify the data type of a column in the table's schema. It allows you to permanently change the data type of a column in the table, affecting all existing and future data in that column.&lt;/p&gt;

&lt;p&gt;Here's an example of using &lt;code&gt;ALTER COLUMN&lt;/code&gt; to modify the data type of a column:&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;

&lt;span class="k"&gt;ALTER&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt;
    &lt;span class="n"&gt;dataset_name&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;person&lt;/span&gt;
&lt;span class="k"&gt;ALTER&lt;/span&gt; &lt;span class="k"&gt;COLUMN&lt;/span&gt;
    &lt;span class="n"&gt;has_car&lt;/span&gt;
&lt;span class="k"&gt;SET&lt;/span&gt;
    &lt;span class="k"&gt;DATA&lt;/span&gt; &lt;span class="k"&gt;TYPE&lt;/span&gt; &lt;span class="nb"&gt;BOOL&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;It's important to note that &lt;code&gt;ALTER COLUMN&lt;/code&gt; is a DDL statement and can only be executed as a separate operation outside of a regular SQL query. Once the column's data type is altered, it will affect all future operations and queries performed on that table.&lt;/p&gt;

&lt;p&gt;In summary, &lt;code&gt;CAST&lt;/code&gt; is used to convert the data type of a column during query execution, while &lt;code&gt;ALTER COLUMN&lt;/code&gt; is used to permanently modify the data type of a column in the table's schema. The choice between the two depends on whether you want to temporarily convert the data type for a specific query or permanently change the data type for the column in the table.&lt;/p&gt;

&lt;h3&gt;
  
  
  Juggle with STRUCT
&lt;/h3&gt;

&lt;p&gt;If we want to apply changes to nested fields, such as adding, removing, or modifying &lt;code&gt;STRUCT&lt;/code&gt; itself there are few different ways to do so.&lt;/p&gt;

&lt;h4&gt;
  
  
  temp table
&lt;/h4&gt;

&lt;p&gt;First, quite simple is using the &lt;code&gt;temp&lt;/code&gt; table.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;

&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;IF&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;EXISTS&lt;/span&gt; &lt;span class="n"&gt;dataset_name&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;person_tmp&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="n"&gt;INT64&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="n"&gt;STRING&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;has_car&lt;/span&gt; &lt;span class="n"&gt;INT64&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;address&lt;/span&gt; &lt;span class="n"&gt;STRUCT&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt;
        &lt;span class="n"&gt;country&lt;/span&gt; &lt;span class="n"&gt;STRING&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;city&lt;/span&gt; &lt;span class="n"&gt;STRING&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;zip_code&lt;/span&gt; &lt;span class="n"&gt;INT64&lt;/span&gt;
    &lt;span class="o"&gt;&amp;gt;&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="c1"&gt;-- fill then new zip_code field with the default 0 value&lt;/span&gt;
&lt;span class="k"&gt;INSERT&lt;/span&gt; &lt;span class="k"&gt;INTO&lt;/span&gt;
    &lt;span class="n"&gt;dataset_name&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;person_tmp&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt;
    &lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;has_car&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="k"&gt;SELECT&lt;/span&gt;
            &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;STRUCT&lt;/span&gt; &lt;span class="n"&gt;address&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;country&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;address&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;city&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;zip_code&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;address&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt;
    &lt;span class="n"&gt;dataset_name&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;person&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="k"&gt;ALTER&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt;
    &lt;span class="n"&gt;IF&lt;/span&gt; &lt;span class="k"&gt;EXISTS&lt;/span&gt; &lt;span class="nv"&gt;`dataset_name.person`&lt;/span&gt; &lt;span class="k"&gt;RENAME&lt;/span&gt; &lt;span class="k"&gt;TO&lt;/span&gt; &lt;span class="nv"&gt;`person_past`&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="k"&gt;ALTER&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt;
    &lt;span class="n"&gt;IF&lt;/span&gt; &lt;span class="k"&gt;EXISTS&lt;/span&gt; &lt;span class="nv"&gt;`dataset_name.person_tmp`&lt;/span&gt; &lt;span class="k"&gt;RENAME&lt;/span&gt; &lt;span class="k"&gt;TO&lt;/span&gt; &lt;span class="nv"&gt;`person`&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="k"&gt;DROP&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;dataset_name&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;person_tmp&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;However, this approach has some drawbacks and considerations to keep in mind: when modifying a BigQuery table using a temporary table, you need to create a new table with the desired modifications and then copy the data from the original table to the temporary table.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Costs&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;As this process involves duplicating the data. It will increase storage usage, leading to additional storage costs as well as it consumes additional query processing resources.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Performance&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;It may impact performance, especially for large tables as you have a limited amount of production resources that are shared.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Complexity and consistency&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Using a temporary table to modify a BigQuery table introduces additional steps and complexity to the process. You need to write queries to create the temporary table, copy data, modify the data, overwrite the original table, and then drop the temporary table. This adds complexity to the overall workflow and may require more code and query execution time.&lt;/p&gt;

&lt;p&gt;Last, but not least, during the modification process, there might be a period where the original table is not accessible or is in an inconsistent state. If other processes or applications depend on the original table's data, this downtime or inconsistency could impact their operations.&lt;/p&gt;

&lt;p&gt;So this is not the very best way.&lt;/p&gt;

&lt;h4&gt;
  
  
  update STRUCT using SET
&lt;/h4&gt;

&lt;p&gt;Another scenario is to change the nested field type. Imagine we would like to update the &lt;code&gt;zip_code&lt;/code&gt; type from &lt;code&gt;STRING&lt;/code&gt; to &lt;code&gt;INT64&lt;/code&gt;. Now we don't want to use the &lt;code&gt;tmp&lt;/code&gt; table way. So the second way is to &lt;a href="https://cloud.google.com/bigquery/docs/reference/standard-sql/dml-syntax#update_statement" rel="noopener noreferrer"&gt;UPDATE&lt;/a&gt; &lt;code&gt;STRUCT&lt;/code&gt; using&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;

&lt;span class="k"&gt;ALTER&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt;
    &lt;span class="n"&gt;dataset_name&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;person&lt;/span&gt;
&lt;span class="k"&gt;ADD&lt;/span&gt;
    &lt;span class="k"&gt;COLUMN&lt;/span&gt; &lt;span class="n"&gt;IF&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;EXISTS&lt;/span&gt; &lt;span class="n"&gt;address_new&lt;/span&gt; &lt;span class="n"&gt;STRUCT&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;country&lt;/span&gt; &lt;span class="n"&gt;STRING&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;city&lt;/span&gt; &lt;span class="n"&gt;STRING&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;zip_code&lt;/span&gt; &lt;span class="n"&gt;STRING&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="k"&gt;UPDATE&lt;/span&gt;
    &lt;span class="nv"&gt;`dataset_name.person`&lt;/span&gt;
&lt;span class="k"&gt;SET&lt;/span&gt;
    &lt;span class="n"&gt;address_new&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="k"&gt;SELECT&lt;/span&gt;
            &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;STRUCT&lt;/span&gt; &lt;span class="n"&gt;address&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;country&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;address&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;city&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="k"&gt;CAST&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;address&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;zip_code&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;STRING&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt;
    &lt;span class="k"&gt;TRUE&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="k"&gt;ALTER&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt;
    &lt;span class="n"&gt;dataset_name&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;person&lt;/span&gt; &lt;span class="k"&gt;RENAME&lt;/span&gt; &lt;span class="k"&gt;COLUMN&lt;/span&gt; &lt;span class="n"&gt;address&lt;/span&gt; &lt;span class="k"&gt;TO&lt;/span&gt; &lt;span class="n"&gt;address_past&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="k"&gt;ALTER&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt;
    &lt;span class="n"&gt;dataset_name&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;person&lt;/span&gt; &lt;span class="k"&gt;RENAME&lt;/span&gt; &lt;span class="k"&gt;COLUMN&lt;/span&gt; &lt;span class="n"&gt;address_new&lt;/span&gt; &lt;span class="k"&gt;TO&lt;/span&gt; &lt;span class="n"&gt;address&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="k"&gt;ALTER&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt;
    &lt;span class="n"&gt;dataset_name&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;person&lt;/span&gt; &lt;span class="k"&gt;DROP&lt;/span&gt; &lt;span class="k"&gt;COLUMN&lt;/span&gt; &lt;span class="n"&gt;address_past&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;In this case, only the &lt;code&gt;STRUCT&lt;/code&gt; field will be duplicated. That is good enough.&lt;/p&gt;

&lt;h4&gt;
  
  
  CREATE OR REPLACE TABLE
&lt;/h4&gt;

&lt;p&gt;Another last approach is using &lt;a href="https://cloud.google.com/bigquery/docs/reference/standard-sql/data-definition-language#create_table_statement" rel="noopener noreferrer"&gt;&lt;code&gt;CREATE OR REPLACE TABLE&lt;/code&gt;&lt;/a&gt;.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;

&lt;span class="k"&gt;CREATE&lt;/span&gt;
&lt;span class="k"&gt;OR&lt;/span&gt; &lt;span class="k"&gt;REPLACE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;dataset_name&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;person&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt;
    &lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;has_car&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="k"&gt;SELECT&lt;/span&gt;
            &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;STRUCT&lt;/span&gt; &lt;span class="n"&gt;address&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;country&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;address&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;city&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="k"&gt;CAST&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;address&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;zip_code&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;STRING&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;address&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt;
    &lt;span class="n"&gt;dataset_name&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;person&lt;/span&gt;


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;In the same way, we can remove nested fields. We can just select the needed fields and omit the ones we don't interested in.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;

&lt;span class="k"&gt;SELECT&lt;/span&gt;
    &lt;span class="n"&gt;address&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt;
    &lt;span class="nv"&gt;`dataset_name.person`&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="k"&gt;SELECT&lt;/span&gt;
    &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;REPLACE&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="k"&gt;SELECT&lt;/span&gt;
                &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;STRUCT&lt;/span&gt; &lt;span class="n"&gt;address&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;
            &lt;span class="k"&gt;EXCEPT&lt;/span&gt;
                &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;zip_code&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;address&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt;
    &lt;span class="nv"&gt;`dataset_name.person`&lt;/span&gt;


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h2&gt;
  
  
  Bonus notes
&lt;/h2&gt;

&lt;p&gt;If you have some table schema from a separate dataset, that you need to create in your particular dataset the easiest the way is using CLI commands as it's a much faster and less error-prone way to create tables.&lt;/p&gt;
&lt;h3&gt;
  
  
  Create a regular table
&lt;/h3&gt;

&lt;p&gt;This is the example of how to save table schema using Table ID to JSON format with &lt;a href="https://cloud.google.com/bigquery/docs/reference/bq-cli-reference#bq_show" rel="noopener noreferrer"&gt;bq show&lt;/a&gt; command&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;

bq show \
    --schema \
    --format=prettyjson \
    project_name:dataset_name.table_name &amp;gt; table_name.json


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;And now you can create a table in your dataset using &lt;a href="https://cloud.google.com/bigquery/docs/reference/bq-cli-reference#bq_mk" rel="noopener noreferrer"&gt;bq mk&lt;/a&gt; command:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;

bq mk \
    --table \
    your_dataset_name.table_name \
    table_name.json


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h3&gt;
  
  
  Create an external table
&lt;/h3&gt;

&lt;p&gt;Here is the example of creating a table definition in JSON format using &lt;a href="https://cloud.google.com/bigquery/docs/reference/bq-cli-reference#bq_mkdef" rel="noopener noreferrer"&gt;bq mkdef&lt;/a&gt;:&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;

bq mkdef \
    --source_format=NEWLINE_DELIMITED_JSON \
    --autodetect=false \
    'gs://bucket_name/prefix/*.json' &amp;gt; table_def


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;The &lt;code&gt;mkdef&lt;/code&gt; command is to create a table definition in JSON format for data stored in Cloud Storage or Google Drive. It will be used to create an external table.&lt;/p&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;

&lt;p&gt;bq mk \&lt;br&gt;
    --table \&lt;br&gt;
    --external_table_definition=nicereply_csat_raw_def \&lt;br&gt;
    dataset_name.table_name \&lt;br&gt;
    table_name.json&lt;/p&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h2&gt;
&lt;br&gt;
  &lt;br&gt;
  &lt;br&gt;
  Contact info&lt;br&gt;
&lt;/h2&gt;

&lt;p&gt;If you found this article helpful, I invite you to connect with me on &lt;a href="https://www.linkedin.com/in/iamtodor/" rel="noopener noreferrer"&gt;LinkedIn&lt;/a&gt;. I am always looking to expand my network and connect with like-minded individuals in the data industry. Additionally, you can also reach out to me for any questions or feedback on the article. I'd be more than happy to engage in a conversation and help out in any way I can. So don’t hesitate to contact me, and let’s connect and learn together.&lt;/p&gt;

</description>
      <category>bigquery</category>
    </item>
    <item>
      <title>Python's BigQuery External CVS tables and null_marker challenge</title>
      <dc:creator>iamtodor</dc:creator>
      <pubDate>Mon, 22 May 2023 13:38:03 +0000</pubDate>
      <link>https://forem.com/iamtodor/pythons-bigquery-external-cvs-tables-and-nullmarker-challenge-on4</link>
      <guid>https://forem.com/iamtodor/pythons-bigquery-external-cvs-tables-and-nullmarker-challenge-on4</guid>
      <description>&lt;p&gt;Testing data pipelines is crucial for ensuring their efficiency and reliability. When dealing with data sources like BigQuery external tables, it becomes necessary to emulate these data sources to perform thorough pipeline testing.&lt;/p&gt;

&lt;p&gt;So idea is to create external CSV table using Python's BigQuery library.&lt;/p&gt;

&lt;p&gt;For additional context, we treat empty value as "" and &lt;code&gt;null&lt;/code&gt; as &lt;code&gt;\N&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Unfortunately, this library does not allow you to specify &lt;code&gt;null_marker&lt;/code&gt; for CSV files while creating an external CSV table using &lt;a href="https://cloud.google.com/python/docs/reference/bigquery/latest/google.cloud.bigquery.external_config.ExternalConfig"&gt;&lt;code&gt;ExternalConfig&lt;/code&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;null_marker&lt;/code&gt; option is not presented in &lt;a href="https://cloud.google.com/bigquery/docs/reference/rest/v2/tables#CsvOptions"&gt;&lt;code&gt;CsvOptions&lt;/code&gt;&lt;/a&gt;, in contrast to, for example, &lt;code&gt;skip_leading_rows&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;external_config&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ExternalConfig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ExternalSourceFormat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;CSV&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;external_config&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;options&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;skip_leading_rows&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
&lt;span class="n"&gt;external_config&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;options&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;null_marker&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;'&lt;/span&gt;&lt;span class="se"&gt;\\&lt;/span&gt;&lt;span class="s"&gt;N'&lt;/span&gt; &lt;span class="c1"&gt;# does not work
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Simultaneously, it's important to note that while a Data Definition Language (DDL) statement that includes the &lt;code&gt;null_marker&lt;/code&gt; option may work successfully when executed directly in the BigQuery console, it might not function as expected when submitted using the Python client.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt;
&lt;span class="k"&gt;OR&lt;/span&gt; &lt;span class="k"&gt;REPLACE&lt;/span&gt; &lt;span class="k"&gt;EXTERNAL&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;project_name&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;dataset_name&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;table_name&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="nv"&gt;`field1`&lt;/span&gt; &lt;span class="n"&gt;STRING&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="nv"&gt;`field2`&lt;/span&gt; &lt;span class="n"&gt;STRING&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;OPTIONS&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;uris&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;'gs://bucket_name/prefix_name/file_name.csv'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;format&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;CSV&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;skip_leading_rows&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;null_marker&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="se"&gt;\\&lt;/span&gt;&lt;span class="s1"&gt;N'&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you execute this query in the BigQuery console, your table will be created successfully and ready for querying. However, if you attempt to submit the same query using the Python library, you will encounter the following exception:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;bq_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;create_table_ddl&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;google&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;api_core&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;exceptions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;BadRequest&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;400&lt;/span&gt; &lt;span class="n"&gt;Syntax&lt;/span&gt; &lt;span class="n"&gt;error&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Illegal&lt;/span&gt; &lt;span class="n"&gt;escape&lt;/span&gt; &lt;span class="n"&gt;sequence&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; \&lt;span class="n"&gt;N&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Even when attempting to shield the &lt;code&gt;\\\\N&lt;/code&gt; sequence, it may not produce the expected outcome. The actual null values are not processed as intended; instead, they remain as &lt;code&gt;\\N&lt;/code&gt; rather than being interpreted as null.&lt;/p&gt;

&lt;p&gt;If you have insights or solutions regarding the challenge of handling null values in emulated tables, kindly share them in the comments section. This will help improve the guide and provide a comprehensive solution for others facing a similar issue.&lt;/p&gt;

</description>
      <category>bigquery</category>
      <category>python</category>
      <category>externaltables</category>
    </item>
    <item>
      <title>The Practical Guide to Utilizing DBT Packages for Data Transformation</title>
      <dc:creator>iamtodor</dc:creator>
      <pubDate>Thu, 12 Jan 2023 10:18:36 +0000</pubDate>
      <link>https://forem.com/freshbooks/the-practical-guide-to-utilizing-dbt-packages-for-data-transformation-4138</link>
      <guid>https://forem.com/freshbooks/the-practical-guide-to-utilizing-dbt-packages-for-data-transformation-4138</guid>
      <description>&lt;h2&gt;
  
  
  Table of content
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;What are packages&lt;/li&gt;
&lt;li&gt;Why use it&lt;/li&gt;
&lt;li&gt;Local packages&lt;/li&gt;
&lt;li&gt;Dbt hub packages&lt;/li&gt;
&lt;li&gt;Verify packages are installed&lt;/li&gt;
&lt;li&gt;Macros usage&lt;/li&gt;
&lt;li&gt;Models usage&lt;/li&gt;
&lt;li&gt;dbt_modules under the hood&lt;/li&gt;
&lt;li&gt;Disclaimer&lt;/li&gt;
&lt;li&gt;Contact&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What are packages
&lt;/h2&gt;

&lt;p&gt;dbt packages are collections of macros, models, and other resources that are used to extend the functionality of dbt. Packages can be used to share common code and resources across multiple dbt projects, and can be published and installed from the &lt;a href="https://hub.getdbt.com/" rel="noopener noreferrer"&gt;dbt Hub&lt;/a&gt;, from GitHub or can be stored locally and installed by specifying the path to the project.&lt;/p&gt;

&lt;p&gt;In dbt, libraries like these are called packages.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why use it
&lt;/h2&gt;

&lt;p&gt;dbt packages are so powerful because so many of the analytic problems we encountered are shared across organizations. &lt;/p&gt;

&lt;p&gt;There are a few general benefits to using packages in dbt:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;Reusability: packages allow you to reuse code across multiple projects and models. This can save you a lot of time and effort, as you don't have to copy and paste the same code into multiple places.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Collaboration: packaging your models in a package allows multiple people to work on the same models at the same time. You can use version control systems like git to manage changes to the models, and use tools like the &lt;code&gt;dbt test&lt;/code&gt; command to ensure that the models are correct and reliable.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Sharing: packaging your models or macros in a package allows you to share them with others. You can publish your package on the &lt;a href="https://hub.getdbt.com/" rel="noopener noreferrer"&gt;dbt Hub&lt;/a&gt; or on GitHub, and others can install and use your models in their own dbt projects.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Managing: packages make it easier to manage your codebase. You can use version control to track changes to your package, and you can easily install and update packages in your dbt project.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Modularity: packaging your models in a package allows you to break your data pipeline into smaller, more manageable pieces, which are easier to understand and maintain. This could make it streamline development and upkeep your dbt project over time. This is especially useful if you are working on a large project with many different models and transformations.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Overall, using packages can help you to build more efficient, maintainable, and scalable data pipelines with dbt.&lt;/p&gt;

&lt;p&gt;For example, if your aim is to extract the day of the week, there is no sense to reinvent the wheel and develop this macro on your own. Rather, we might want to find the right package and make use of it &lt;a href="https://github.com/calogica/dbt-date#day_of_weekdate-isoweektrue" rel="noopener noreferrer"&gt;&lt;code&gt;{{ dbt_date.day_of_week(column_name) }}&lt;/code&gt;&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Local packages
&lt;/h2&gt;

&lt;p&gt;In dbt, you can use local packages to organize and reuse code within a single dbt project. Local packages are stored within your project directory and are only available to the models in that project. The best use-case for local packages is some module that you want to live in the same repository, nearby to the main project.&lt;/p&gt;

&lt;p&gt;To create a reusable local package do the following:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Consider you have the following dbt project dir structure
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;&amp;gt;&amp;gt;&amp;gt; tree -L 1 .
.
├── data
├── dbt_project.yml
├── macros
├── models
├── packages
├── packages.yml
├── profiles.yml
└── target
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;Create &lt;code&gt;packages&lt;/code&gt; dir, so here we would put our first local package.
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;mkdir packages
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;Create &lt;code&gt;packages.yml&lt;/code&gt; file, so here we would link our first local package.
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;touch packages.yml
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;Before moving on please verify that you have the following dir structure
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;&amp;gt;&amp;gt;&amp;gt; tree -L 1 .
.
├── data
├── dbt_modules
├── dbt_project.yml
├── macros
├── models
├── packages
├── packages.yml
├── profiles.yml
├── snapshots
└── target
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;Jump into &lt;code&gt;packages&lt;/code&gt; dir and init your package with the name &lt;code&gt;local_utils&lt;/code&gt;. The name of package is arbitrary.
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;cd packages
dbt init local_utils
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It will create a package with the following structure:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;&amp;gt;&amp;gt;&amp;gt; tree local_utils
local_utils
├── README.md
├── analysis
├── data
├── dbt_project.yml
├── macros
├── models
│   └── example
│       ├── my_first_dbt_model.sql
│       ├── my_second_dbt_model.sql
│       └── schema.yml
├── snapshots
└── tests
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;Next, you need to change the project &lt;code&gt;name&lt;/code&gt; in &lt;code&gt;dbt_project.yml&lt;/code&gt; from &lt;code&gt;my_new_project&lt;/code&gt; to a meaningful and self-explainable name. This name will be used further as macro or model references. Let's call it &lt;code&gt;local_utils&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Specify our local package in the before-mentioned &lt;code&gt;packages.yml&lt;/code&gt; as follows:
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;packages&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;local&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;/opt/dbt/packages/local_utils/&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Make sure that you provide your absolute path to the packages. Otherwise, it would not work.&lt;/p&gt;

&lt;p&gt;Save the &lt;code&gt;packages.yml&lt;/code&gt; file and run the &lt;code&gt;dbt deps&lt;/code&gt; command to install the package. This will link the package and make it available to your dbt models.&lt;/p&gt;

&lt;p&gt;Here is an example of what the &lt;code&gt;dbt deps&lt;/code&gt; command might look like when you install your local package:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;&amp;gt;&amp;gt;&amp;gt; dbt deps
Running with dbt=0.21.1
Installing /opt/dbt/packages/local_utils/
  Installed from &amp;lt;local @ /opt/dbt/packages/local_utils/&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now you can observe a newly created &lt;code&gt;dbt_modules&lt;/code&gt; dir, that contains binary file &lt;code&gt;local_utils&lt;/code&gt;. It means than our local package is ready to be used.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;&amp;gt;&amp;gt;&amp;gt; tree dbt_modules
dbt_modules
└── utils -&amp;gt; /opt/dbt/packages/local_utils/
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Dbt hub packages
&lt;/h2&gt;

&lt;p&gt;In dbt, you can use packages from the dbt Hub to share your code with others and to reuse code from other users in your own projects. The &lt;a href="https://hub.getdbt.com/" rel="noopener noreferrer"&gt;dbt Hub&lt;/a&gt; is a community-driven library of packages that you can use to extend the functionality of dbt.&lt;/p&gt;

&lt;p&gt;Probably, the best examples of third-party packages driven by the community would be:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://hub.getdbt.com/dbt-labs/dbt_utils/latest/" rel="noopener noreferrer"&gt;dbt_utils&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://hub.getdbt.com/calogica/dbt_expectations/latest/" rel="noopener noreferrer"&gt;dbt_expectations&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;There are a few benefits to using dbt Hub packages:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;Reusable code: dbt Hub packages allow you to reuse code that has been shared by other users, teams and companies. This can save you a lot of time and effort, as you don't have to write the same logic from scratch, test and maintain it.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;The major advantage of any open-source:&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;ul&gt;
&lt;li&gt;Community support: When you use packages from the dbt Hub, you can benefit from the support and expertise of the dbt community. If you have questions or run into issues with a package, you can ask for help on the dbt community forums or Slack channel.&lt;/li&gt;
&lt;li&gt;Collaboration: By sharing your own packages on the dbt Hub, you can make your code available to other users. This can help to foster collaboration and improve the overall quality of your code.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Overall, using dbt Hub packages can help you to build more efficient, maintainable, and scalable data pipelines with dbt, and to collaborate with others in the dbt community.&lt;/p&gt;

&lt;p&gt;To install a package from the dbt Hub in your dbt project, you will need to add the package to your packages.yml file.&lt;/p&gt;

&lt;p&gt;Here is the basic process:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Go to the &lt;a href="https://hub.getdbt.com/" rel="noopener noreferrer"&gt;dbt Hub&lt;/a&gt; and search for the package you want to install.&lt;/li&gt;
&lt;li&gt;Click on the package to view its details.&lt;/li&gt;
&lt;li&gt;Copy the package name and version from the installation instructions.&lt;/li&gt;
&lt;li&gt;Open your &lt;code&gt;packages.yml&lt;/code&gt; file and add the package name and version to the packages list. It should look something like this:
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;packages&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;dbt-labs/dbt_utils&lt;/span&gt;
    &lt;span class="na"&gt;version&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;0.7.6&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Save the &lt;code&gt;packages.yml&lt;/code&gt; file and run the &lt;code&gt;dbt deps&lt;/code&gt; command to install the package. This will download the package and make it available to your dbt models.&lt;/p&gt;

&lt;p&gt;Here is an example of what the &lt;code&gt;dbt deps&lt;/code&gt; command might look like when you install dbt hub package:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;&amp;gt;&amp;gt;&amp;gt; dbt deps
Installing dbt-labs/dbt_utils@0.7.6
  Installed from version 0.7.6
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Unlike of local package, hub package was downloaded to &lt;code&gt;dbt_modules&lt;/code&gt; dir physically.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;&amp;gt;&amp;gt;&amp;gt; tree -L 2 dbt_modules
dbt_modules
└── dbt_utils
    ├── CHANGELOG.md
    ├── LICENSE
    ├── README.md
    ├── RELEASE.md
    ├── dbt_project.yml
    ├── docker-compose.yml
    ├── etc
    ├── integration_tests
    ├── macros
    └── run_test.sh
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Verify packages are installed
&lt;/h2&gt;

&lt;p&gt;To verify that a package is installed in your dbt project, you can check the &lt;code&gt;packages.yml&lt;/code&gt; file and run the &lt;code&gt;dbt deps&lt;/code&gt; command.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;Check the &lt;code&gt;packages.yml&lt;/code&gt; file: This file lists all of the packages that are installed in your dbt project. Look for the name of the package you want to verify. If it is listed in the packages list, then it is installed.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Run the &lt;code&gt;dbt deps&lt;/code&gt; command:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;This command will show you a list of all of the packages that are installed in your dbt project. Look for the name of the package you want to verify. If it is listed, then it is installed.&lt;/li&gt;
&lt;li&gt;In the root dbt project dir, you observe a new dir &lt;code&gt;dbt_modules/&lt;/code&gt; which contains the compiled packages that are ready to be used. &lt;strong&gt;NOTE&lt;/strong&gt;: dir &lt;code&gt;dbt_modules/&lt;/code&gt; has to be added to &lt;code&gt;.gitignore&lt;/code&gt;.
&lt;/li&gt;
&lt;/ol&gt;


&lt;/li&gt;

&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;&amp;gt;&amp;gt;&amp;gt; tree -L 1 .
.
├── data
├── dbt_modules
├── dbt_project.yml
├── macros
├── models
├── packages
├── packages.yml
├── profiles.yml
├── snapshots
└── target
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If your &lt;code&gt;packages.yml&lt;/code&gt; file contains package that is not installed then you would not be able to run any dbt command:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;&amp;gt;&amp;gt;&amp;gt; dbt list
Encountered an error:
Compilation Error
  dbt found 1 package(s) specified in packages.yml, but only 0 package(s) installed in dbt_modules. Run dbt deps to install package dependencies.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;So this is our guarantee that in runtime we would not have any issues related to the package installation.&lt;/p&gt;

&lt;h2&gt;
  
  
  Macros usage
&lt;/h2&gt;

&lt;p&gt;In dbt, you can use packages to define custom macros that can be called from your dbt models. Here is an example of how you might use a package to define a custom macro.&lt;/p&gt;

&lt;p&gt;Here are a few examples of how you might use macros in dbt to perform common data transformations.&lt;/p&gt;

&lt;p&gt;For instance, lets create the following macros in our local package under &lt;code&gt;local_utils/macros/cents_to_dollars.sql&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="n"&gt;macro&lt;/span&gt; &lt;span class="n"&gt;cents_to_dollars&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;column_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;precision&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;({{&lt;/span&gt; &lt;span class="k"&gt;column_name&lt;/span&gt; &lt;span class="p"&gt;}}&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;)::&lt;/span&gt;&lt;span class="nb"&gt;numeric&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;16&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{{&lt;/span&gt; &lt;span class="nb"&gt;precision&lt;/span&gt; &lt;span class="p"&gt;}})&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="n"&gt;endmacro&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Next we can call our macros as &lt;code&gt;{{ local_utils.cents_to_dollars(your_column_name) }}&lt;/code&gt;. The &lt;code&gt;local_utils&lt;/code&gt; package names comes from the &lt;code&gt;name&lt;/code&gt; in our package &lt;code&gt;dbt_project.yml&lt;/code&gt; file.&lt;/p&gt;

&lt;p&gt;Usage a macros from dbt hub packages is pretty much the same. Imagine we want to generate a surrogate key based on a few columns. This is the functional &lt;code&gt;dbt-utils&lt;/code&gt; we previously installed provides: &lt;a href="https://github.com/dbt-labs/dbt-utils#generate_surrogate_key-source" rel="noopener noreferrer"&gt;&lt;code&gt;{{ dbt_utils.generate_surrogate_key([field_a, field_b[,...]]) }}&lt;/code&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;So the macros usage pattern from the third-party package is &lt;code&gt;{{ package_name.macros_name() }}&lt;/code&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Models usage
&lt;/h2&gt;

&lt;p&gt;As we created our own local package &lt;code&gt;local_utils&lt;/code&gt; as a prerequisite, which has the following structure:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;&amp;gt;&amp;gt;&amp;gt; tree packages/local_utils
local_utils
├── README.md
├── analysis
├── data
├── dbt_project.yml
├── macros
├── models
│   └── example
│       ├── my_first_dbt_model.sql
│       ├── my_second_dbt_model.sql
│       └── schema.yml
├── snapshots
└── tests
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The most important function in dbt is &lt;code&gt;ref()&lt;/code&gt;; its impossible to build even moderately complex models without it. &lt;code&gt;ref()&lt;/code&gt; is how you reference one model within another inside your package. Here is how this looks in practice:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;select&lt;/span&gt;
    &lt;span class="o"&gt;*&lt;/span&gt;
&lt;span class="k"&gt;from&lt;/span&gt;
    &lt;span class="p"&gt;{{&lt;/span&gt;&lt;span class="k"&gt;ref&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;"model_a"&lt;/span&gt;&lt;span class="p"&gt;)}}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;So if we would like to reference &lt;code&gt;my_first_dbt_model&lt;/code&gt; from &lt;code&gt;my_second_dbt_model&lt;/code&gt; within &lt;code&gt;local_utils&lt;/code&gt; package then we do the following:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;select&lt;/span&gt;
    &lt;span class="o"&gt;*&lt;/span&gt;
&lt;span class="k"&gt;from&lt;/span&gt;
    &lt;span class="p"&gt;{{&lt;/span&gt; &lt;span class="k"&gt;ref&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;"my_first_dbt_model"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;}}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If we want to reference &lt;code&gt;my_first_dbt_model&lt;/code&gt; from our main project then we need to slightly change the way we call it. There is also a two-argument variant of the &lt;code&gt;ref&lt;/code&gt; function. With this variant, you can pass both a package name and model name to &lt;code&gt;ref&lt;/code&gt; to avoid ambiguity:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;select&lt;/span&gt;
    &lt;span class="o"&gt;*&lt;/span&gt;
&lt;span class="k"&gt;from&lt;/span&gt;
    &lt;span class="p"&gt;{{&lt;/span&gt; &lt;span class="k"&gt;ref&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;"package_name"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;"model_name"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;}}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Our particular case would be as:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;select&lt;/span&gt;
    &lt;span class="o"&gt;*&lt;/span&gt;
&lt;span class="k"&gt;from&lt;/span&gt;
    &lt;span class="p"&gt;{{&lt;/span&gt; &lt;span class="k"&gt;ref&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;"local_utils"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;"my_first_dbt_model"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;}}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Note: The &lt;code&gt;package_name&lt;/code&gt; should only include the name of the package, not the maintainer. For example, if we use the &lt;code&gt;dbt-labs/dbt-utils&lt;/code&gt; package, type &lt;code&gt;dbt-utils&lt;/code&gt; in that argument, and not &lt;code&gt;dbt-labs/dbt-utils&lt;/code&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  dbt_modules under the hood
&lt;/h2&gt;

&lt;p&gt;The &lt;code&gt;dbt_modules&lt;/code&gt; directory is a directory that is used by dbt to store packages and their models. When you install a package using the &lt;code&gt;dbt deps&lt;/code&gt; command, the package and its models are downloaded and stored in the &lt;code&gt;dbt_modules&lt;/code&gt; directory.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;dbt_modules&lt;/code&gt; directory is located in the root directory of your dbt project. It contains subdirectories for each installed package, and each package directory contains the packages models, macros, and other resources.&lt;/p&gt;

&lt;p&gt;The way dbt installs &lt;code&gt;local&lt;/code&gt; and &lt;code&gt;dbt hub&lt;/code&gt; packages is different.&lt;/p&gt;

&lt;p&gt;Considering to have the following &lt;code&gt;package.yml&lt;/code&gt; content:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;packages&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;package&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;dbt-labs/dbt_utils&lt;/span&gt;
    &lt;span class="na"&gt;version&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;0.7.6&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;local&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;/opt/dbt/packages/local_utils/&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You would have the following modules under generated &lt;code&gt;dbt_modules&lt;/code&gt; dir:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;&amp;gt;&amp;gt;&amp;gt; tree -L 2 dbt_modules
dbt_modules
├── dbt_utils
│   ├── CHANGELOG.md
│   ├── LICENSE
│   ├── README.md
│   ├── RELEASE.md
│   ├── dbt_project.yml
│   ├── docker-compose.yml
│   ├── etc
│   ├── integration_tests
│   ├── macros
│   └── run_test.sh
└── utils -&amp;gt; /opt/dbt/packages/local_utils/
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;As I mentioned before, it creates a symlink for local packages, and for the dbt hub package, it simply copies all the needed files in the same name folder.&lt;/p&gt;

&lt;p&gt;We use Google Cloud Composer to orchestrate all the transformation jobs. We basically copy our project to GCP bucket with &lt;code&gt;gsutil -m rsync&lt;/code&gt;. Unfortunately, it does not support symbolic links:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Since gsutil rsync is intended to support data operations (like moving a data set to the cloud for computational processing) and it needs to be compatible both in the cloud and across common operating systems, there are no plans for gsutil rsync to support operating system-specific file types like symlinks.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Taken from &lt;code&gt;gutils rsync&lt;/code&gt;'s documentation &lt;a href="https://cloud.google.com/storage/docs/gsutil/commands/rsync#be-careful-when-synchronizing-over-os-specific-file-types-symlinks,-devices,-etc" rel="noopener noreferrer"&gt;Be Careful When Synchronizing Over Os-Specific File Types (Symlinks, Devices, Etc.)&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The possible solution is to compress everything locally to archive, copy it to the bucket, and then unpack it to composer:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;tar -czvf dbt-project.tar.gz dbt-project
gsutil -m rsync dbt-project.tar.gz gs://$BUCKET/prefix
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Here’s what those switches actually mean:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;-c: Create an archive.
-z: Compress the archive with gzip.
-v: Display progress in the terminal while creating the archive, also known as “verbose” mode. The v is always optional in these commands, but it’s helpful.
-f: Allows you to specify the filename of the archive.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;These are pitfalls that we met when working with dbt packages.&lt;/p&gt;

&lt;h2&gt;
  
  
  Disclaimer
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;All this experience applies to dbt v0.21.1&lt;/li&gt;
&lt;li&gt;I am aware of since v1.0 they &lt;a href="https://docs.getdbt.com/guides/migration/versions/upgrading-to-v1.0#breaking-changes" rel="noopener noreferrer"&gt;changed&lt;/a&gt; the default value to &lt;code&gt;dbt_packages&lt;/code&gt; instead of &lt;code&gt;dbt_modules&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;I like to think that most of the guide still appears to be applicable to the latest dbt version&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Contact
&lt;/h2&gt;

&lt;p&gt;If you found this article helpful, I invite you to connect with me on &lt;a href="https://www.linkedin.com/in/iamtodor/" rel="noopener noreferrer"&gt;LinkedIn&lt;/a&gt;. I am always looking to expand my network and connect with like-minded individuals in the data industry. Additionally, you can also reach out to me for any questions or feedback on the article. I'd be more than happy to engage in a conversation and help out in any way I can. So don’t hesitate to contact me, and let’s connect and learn together.&lt;/p&gt;

</description>
      <category>gratitude</category>
      <category>learning</category>
      <category>tutorial</category>
      <category>productivity</category>
    </item>
    <item>
      <title>Configuring python linting to be part of CI/CD using GitHub actions</title>
      <dc:creator>iamtodor</dc:creator>
      <pubDate>Thu, 15 Sep 2022 06:44:07 +0000</pubDate>
      <link>https://forem.com/freshbooks/configuring-python-linting-to-be-part-of-cicd-using-github-actions-1731</link>
      <guid>https://forem.com/freshbooks/configuring-python-linting-to-be-part-of-cicd-using-github-actions-1731</guid>
      <description>&lt;p&gt;Hello everyone, I am a DataOps Engineer at &lt;a href="https://www.freshbooks.com/" rel="noopener noreferrer"&gt;FreshBooks&lt;/a&gt;. In this article I would like to share my experience on configuration best practices for GitHub actions pipelines for linting.&lt;/p&gt;

&lt;p&gt;Freshbooks DataOps team has a linter configuration that developers can run before submitting a PR. We had an idea to integrate lint checks into our regular CI/CD pipeline. This adoption would eliminate potential errors, bugs, stylistic errors. We will basically enforce the common code style across the team.&lt;/p&gt;

&lt;p&gt;FreshBooks uses GitHub as a home for our code base, so we would like to use it as much as possible. Recently I finished this configuration so the linter and its checks are now part of a GitHub actions CI/CD workflow.&lt;/p&gt;

&lt;p&gt;This article has two major parts: the first one is linter configuration, and the second one is GitHub workflow configuration itself. Feel free to read all the parts, or skip some and jump into specific one you are interested in.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
Linters configuration

&lt;ul&gt;
&lt;li&gt;Disable unwanted checks&lt;/li&gt;
&lt;li&gt;Documentation&lt;/li&gt;
&lt;li&gt;Import error&lt;/li&gt;
&lt;li&gt;Tweaks for airflow code&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

GitHub workflow actions CI/CD configurations

&lt;ul&gt;
&lt;li&gt;When to run it&lt;/li&gt;
&lt;li&gt;What files does it run against&lt;/li&gt;
&lt;li&gt;Run linter itself&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;Conclusion&lt;/li&gt;

&lt;li&gt;Contact&lt;/li&gt;

&lt;/ul&gt;

&lt;h2&gt;
  
  
  Linters configuration
&lt;/h2&gt;

&lt;p&gt;Here are the linters and checks we are going to use:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://flake8.pycqa.org/en/latest/" rel="noopener noreferrer"&gt;flake8&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://flakeheaven.readthedocs.io/en/latest/" rel="noopener noreferrer"&gt;flakeheaven&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/psf/black" rel="noopener noreferrer"&gt;black&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/PyCQA/isort" rel="noopener noreferrer"&gt;isort&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Disclaimer&lt;/strong&gt;: author assumes you are familiar with the above-mentioned linters, tools, and checks.&lt;/p&gt;

&lt;p&gt;I would like to share how to configure them for the python project. I prepared a full &lt;a href="https://github.com/iamtodor/demo-github-actions-python-linter-configuration" rel="noopener noreferrer"&gt;github actions python configuration demo repository&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;We use &lt;code&gt;flakeheaven&lt;/code&gt; as a &lt;code&gt;flake8&lt;/code&gt; wrapper, which is very easy to configure in one single &lt;code&gt;pyproject.toml&lt;/code&gt;. The whole &lt;code&gt;pyproject.toml&lt;/code&gt; configuration file can be found in&lt;br&gt;
a &lt;a href="https://github.com/iamtodor/demo-github-actions-python-linter-configuration/blob/main/pyproject.toml" rel="noopener noreferrer"&gt;demo repo&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fgithub.com%2Fiamtodor%2Fdemo-github-actions-python-linter-configuration%2Fblob%2Fmain%2Farticle%2Fimg%2Fflakeheaven-pyproject-config.png%3Fraw%3Dtrue" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fgithub.com%2Fiamtodor%2Fdemo-github-actions-python-linter-configuration%2Fblob%2Fmain%2Farticle%2Fimg%2Fflakeheaven-pyproject-config.png%3Fraw%3Dtrue" alt="pyproject.toml"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I would say the config file is self-explainable, so I will not stop here for long. Just a few notes about tiny tweaks.&lt;/p&gt;

&lt;h3&gt;
  
  
  Disable unwanted checks
&lt;/h3&gt;

&lt;p&gt;A few checks that we don't want to see complaints about:&lt;/p&gt;

&lt;h4&gt;
  
  
  Documentation
&lt;/h4&gt;

&lt;p&gt;The default &lt;code&gt;flakeheaven&lt;/code&gt; configuration assumes every component is documented.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;

&amp;gt;&amp;gt;&amp;gt; python -m flakeheaven lint utils.py

utils.py
     1:   1 C0114 Missing module docstring (missing-module-docstring) [pylint]
  def custom_sum(first: int, second: int) -&amp;gt; int:
  ^
     1:   1 C0116 Missing function or method docstring (missing-function-docstring) [pylint]
  def custom_sum(first: int, second: int) -&amp;gt; int:
  ^
     5:   1 C0116 Missing function or method docstring (missing-function-docstring) [pylint]
  def custom_multiplication(first: int, second: int) -&amp;gt; int:


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;We are ok if not every module will be documented. We are also ok if not every function or method will be documented. We are not going to push documentation for documentation's sake. So we want to disable &lt;code&gt;C0114&lt;/code&gt; and &lt;code&gt;C0116&lt;/code&gt; checks from pylint.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fgithub.com%2Fiamtodor%2Fdemo-github-actions-python-linter-configuration%2Fblob%2Fmain%2Farticle%2Fimg%2Fflakeheaven-disable-docs.png%3Fraw%3Dtrue" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fgithub.com%2Fiamtodor%2Fdemo-github-actions-python-linter-configuration%2Fblob%2Fmain%2Farticle%2Fimg%2Fflakeheaven-disable-docs.png%3Fraw%3Dtrue" alt="flakeheaven disable docs"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Import error
&lt;/h4&gt;

&lt;p&gt;Our linter requirements live in a separate file and we don't aim to mix it with our main production requirements. Hence, linter would complain about import libraries as linter env does not have production libraries, quite obvious.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;

&amp;gt;&amp;gt;&amp;gt; python -m flakeheaven lint . 

dags/dummy.py
     3:   1 E0401 Unable to import 'airflow' (import-error) [pylint]
  from airflow import DAG
  ^
     4:   1 E0401 Unable to import 'airflow.operators.dummy_operator' (import-error) [pylint]
  from airflow.operators.dummy_operator import DummyOperator
  ^


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;So we need to disable &lt;code&gt;E0401&lt;/code&gt; check from &lt;code&gt;pylint&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fgithub.com%2Fiamtodor%2Fdemo-github-actions-python-linter-configuration%2Fblob%2Fmain%2Farticle%2Fimg%2Fflakeheaven-disable-import-checks.png%3Fraw%3Dtrue" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fgithub.com%2Fiamtodor%2Fdemo-github-actions-python-linter-configuration%2Fblob%2Fmain%2Farticle%2Fimg%2Fflakeheaven-disable-import-checks.png%3Fraw%3Dtrue" alt="flakeheaven disable import checks"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We assume that the developer who writes the code and imports the libs is responsible for writing reliable tests. So if the test does not pass it means that it's something with the import or code (logic) itself. Thus, the import check is not something we would like to put as a linter job.&lt;/p&gt;

&lt;p&gt;Also, there is another possible solution to disable this check by including &lt;code&gt;# noqa: E0401&lt;/code&gt; after the import statement. &lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;

&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;airflow&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;DAG&lt;/span&gt;  &lt;span class="c1"&gt;# noqa: E0401
&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;airflow.operators.dummy_operator&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;DummyOperator&lt;/span&gt;  &lt;span class="c1"&gt;# noqa: E0401
&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h4&gt;
  
  
  Tweaks for airflow code
&lt;/h4&gt;

&lt;p&gt;To configure code for Airflow DAGs there are also a few tweaks. Here is the dummy example &lt;code&gt;dummy.py&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fgithub.com%2Fiamtodor%2Fdemo-github-actions-python-linter-configuration%2Fblob%2Fmain%2Farticle%2Fimg%2Fpython-airflow-tasks-order.png%3Fraw%3Dtrue" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fgithub.com%2Fiamtodor%2Fdemo-github-actions-python-linter-configuration%2Fblob%2Fmain%2Farticle%2Fimg%2Fpython-airflow-tasks-order.png%3Fraw%3Dtrue" alt="python dummy DAG"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If we run &lt;code&gt;flakeheaven&lt;/code&gt; with the default configuration we would see the following error:&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;

&amp;gt;&amp;gt;&amp;gt; python -m flakeheaven lint .                                                       

dags/dummy.py
    17:   9 W503 line break before binary operator [pycodestyle]
  &amp;gt;&amp;gt; dummy_operator_2
  ^
    18:   9 W503 line break before binary operator [pycodestyle]
  &amp;gt;&amp;gt; dummy_operator_3
  ^
    19:   9 W503 line break before binary operator [pycodestyle]
  &amp;gt;&amp;gt; [dummy_operator_4, dummy_operator_5, dummy_operator_6, dummy_operator_7]
  ^


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;However, we want to keep each task specified in a new line, hence we need to disable &lt;code&gt;W503&lt;/code&gt; from pycodestyle.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fgithub.com%2Fiamtodor%2Fdemo-github-actions-python-linter-configuration%2Fblob%2Fmain%2Farticle%2Fimg%2Fflakeheaven-diable-line-break-W503.png%3Fraw%3Dtrue" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fgithub.com%2Fiamtodor%2Fdemo-github-actions-python-linter-configuration%2Fblob%2Fmain%2Farticle%2Fimg%2Fflakeheaven-diable-line-break-W503.png%3Fraw%3Dtrue" alt="disable W503"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Next, with the default configuration we would get the next warning:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;

&amp;gt;&amp;gt;&amp;gt; python -m flakeheaven lint .                                                       

dags/dummy.py
    15:   5 W0104 Statement seems to have no effect (pointless-statement) [pylint]
  (
  ^


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;This is about how we specify task order. The workaround here is to exclude &lt;code&gt;W0104&lt;/code&gt; from pylint.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fgithub.com%2Fiamtodor%2Fdemo-github-actions-python-linter-configuration%2Fblob%2Fmain%2Farticle%2Fimg%2Fflakeheaven-disable-statement-no-effect-W0104.png%3Fraw%3Dtrue" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fgithub.com%2Fiamtodor%2Fdemo-github-actions-python-linter-configuration%2Fblob%2Fmain%2Farticle%2Fimg%2Fflakeheaven-disable-statement-no-effect-W0104.png%3Fraw%3Dtrue" alt="disable W0104"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;More info about rules could be found on &lt;a href="https://www.flake8rules.com/" rel="noopener noreferrer"&gt;flake8 rules page&lt;/a&gt;. &lt;/p&gt;

&lt;h2&gt;
  
  
  GitHub workflow actions CI/CD configurations
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Disclaimer&lt;/strong&gt;: author assumes you are familiar with &lt;a href="https://github.com/features/actions" rel="noopener noreferrer"&gt;GitHub actions&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;We configure GitHub Workflow to be triggered on every PR against the main (master) branch.&lt;/p&gt;

&lt;p&gt;The whole &lt;code&gt;py_linter.yml&lt;/code&gt; config can be found in a &lt;a href="https://github.com/iamtodor/demo-github-actions-python-linter-configuration/blob/main/.github/workflows/py_linter.yml" rel="noopener noreferrer"&gt;demo repo&lt;/a&gt;. I will walk you through it step by step.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fgithub.com%2Fiamtodor%2Fdemo-github-actions-python-linter-configuration%2Fblob%2Fmain%2Farticle%2Fimg%2Fgh-config-full-v3.png%3Fraw%3Dtrue" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fgithub.com%2Fiamtodor%2Fdemo-github-actions-python-linter-configuration%2Fblob%2Fmain%2Farticle%2Fimg%2Fgh-config-full-v3.png%3Fraw%3Dtrue" alt="py_linter.yml"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  When to run it
&lt;/h3&gt;

&lt;p&gt;We are interested in running linter only when a PR has &lt;code&gt;.py&lt;/code&gt; files. For instance, when we update &lt;code&gt;README.md&lt;/code&gt; there is no sense in running a python linter.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fgithub.com%2Fiamtodor%2Fdemo-github-actions-python-linter-configuration%2Fblob%2Fmain%2Farticle%2Fimg%2Fgh-config-py-push-pr.png%3Fraw%3Dtrue" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fgithub.com%2Fiamtodor%2Fdemo-github-actions-python-linter-configuration%2Fblob%2Fmain%2Farticle%2Fimg%2Fgh-config-py-push-pr.png%3Fraw%3Dtrue" alt="configure run workflow on PRs and push"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  What files does it run against
&lt;/h3&gt;

&lt;p&gt;We are interested in running a linter only against the modified files. Let's say, we take a look at the provided repo, if I update &lt;code&gt;dags/dummy.py&lt;/code&gt; I don't want to waste time and resources running the linter against &lt;code&gt;main.py&lt;/code&gt;. For this purpose we use &lt;a href="https://github.com/dorny/paths-filter" rel="noopener noreferrer"&gt;Paths Filter GitHub Action&lt;/a&gt;, which is very flexible.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fgithub.com%2Fiamtodor%2Fdemo-github-actions-python-linter-configuration%2Fblob%2Fmain%2Farticle%2Fimg%2Fgh-config-paths-filter-v2.png%3Fraw%3Dtrue" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fgithub.com%2Fiamtodor%2Fdemo-github-actions-python-linter-configuration%2Fblob%2Fmain%2Farticle%2Fimg%2Fgh-config-paths-filter-v2.png%3Fraw%3Dtrue" alt="Paths Filter GitHub Action"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If we have modified a &lt;code&gt;.py&lt;/code&gt; file and any other files such as &lt;code&gt;.toml&lt;/code&gt; in one PR, we don't want to run a linter against the non-python files, so we configure filtering only for &lt;code&gt;.py&lt;/code&gt; files no matter the location: root, tests, src, etc.&lt;/p&gt;

&lt;p&gt;The changed file can have the following statuses: &lt;code&gt;added&lt;/code&gt;, &lt;code&gt;modified&lt;/code&gt;, or &lt;code&gt;deleted&lt;/code&gt;. There is no reason to run the linter against deleted files as your workflow would simply fail, because that particular changed file is no longer in the repo. So we need to configure what changes we consider triggering the linter.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fgithub.com%2Fiamtodor%2Fdemo-github-actions-python-linter-configuration%2Fblob%2Fmain%2Farticle%2Fimg%2Fgh-config-paths-filter-modified-v2.png%3Fraw%3Dtrue" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fgithub.com%2Fiamtodor%2Fdemo-github-actions-python-linter-configuration%2Fblob%2Fmain%2Farticle%2Fimg%2Fgh-config-paths-filter-modified-v2.png%3Fraw%3Dtrue" alt="added|modified"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I define the variable where I can find the output (only the &lt;code&gt;.py&lt;/code&gt; files) from the previous filter. This variable would contain modified &lt;code&gt;.py&lt;/code&gt; files that I can further pass to a &lt;code&gt;flakeheaven&lt;/code&gt;, &lt;code&gt;black&lt;/code&gt;, and &lt;code&gt;isort&lt;/code&gt;. By default, the output is disabled and "Paths Changes Filter" allows you to customize it: you can list the files in &lt;code&gt;.csv&lt;/code&gt;, &lt;code&gt;.json&lt;/code&gt;, or in a &lt;code&gt;shell&lt;/code&gt; mode. Linters accept files separated simply by space, so our choice here is &lt;code&gt;shell&lt;/code&gt; mode.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fgithub.com%2Fiamtodor%2Fdemo-github-actions-python-linter-configuration%2Fblob%2Fmain%2Farticle%2Fimg%2Fgh-config-paths-filter-list-shell.png%3Fraw%3Dtrue" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fgithub.com%2Fiamtodor%2Fdemo-github-actions-python-linter-configuration%2Fblob%2Fmain%2Farticle%2Fimg%2Fgh-config-paths-filter-list-shell.png%3Fraw%3Dtrue" alt="list files shell"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Run linter itself
&lt;/h3&gt;

&lt;p&gt;The next and last step is to run the linter itself.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fgithub.com%2Fiamtodor%2Fdemo-github-actions-python-linter-configuration%2Fblob%2Fmain%2Farticle%2Fimg%2Fgh-config-run-linter-step.png%3Fraw%3Dtrue" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fgithub.com%2Fiamtodor%2Fdemo-github-actions-python-linter-configuration%2Fblob%2Fmain%2Farticle%2Fimg%2Fgh-config-run-linter-step.png%3Fraw%3Dtrue" alt="run linter step"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Before we run the linter on changed files we run a check to see if there are actual changes in &lt;code&gt;.py&lt;/code&gt; files by checking if there are any &lt;code&gt;.py&lt;/code&gt; files from the previous step.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fgithub.com%2Fiamtodor%2Fdemo-github-actions-python-linter-configuration%2Fblob%2Fmain%2Farticle%2Fimg%2Fgh-config-run-linter-check-for-changes.png%3Fraw%3Dtrue" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fgithub.com%2Fiamtodor%2Fdemo-github-actions-python-linter-configuration%2Fblob%2Fmain%2Farticle%2Fimg%2Fgh-config-run-linter-check-for-changes.png%3Fraw%3Dtrue" alt="check if there are .py files"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Next, using the before-mentioned output variable we can safety pass the content from this &lt;code&gt;steps.filter.outputs.py_scripts_filter_files&lt;/code&gt; variable to linter.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fgithub.com%2Fiamtodor%2Fdemo-github-actions-python-linter-configuration%2Fblob%2Fmain%2Farticle%2Fimg%2Fgh-config-run-linter-commands.png%3Fraw%3Dtrue" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fgithub.com%2Fiamtodor%2Fdemo-github-actions-python-linter-configuration%2Fblob%2Fmain%2Farticle%2Fimg%2Fgh-config-run-linter-commands.png%3Fraw%3Dtrue" alt="linter commands"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;That's all I would like to share. I hope it is useful for you, and that you can utilize this experience and knowledge. &lt;/p&gt;

&lt;p&gt;I wish you to see these successful checks every time you push your code :)&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fgithub.com%2Fiamtodor%2Fdemo-github-actions-python-linter-configuration%2Fblob%2Fmain%2Farticle%2Fimg%2Flinter-success.png%3Fraw%3Dtrue" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fgithub.com%2Fiamtodor%2Fdemo-github-actions-python-linter-configuration%2Fblob%2Fmain%2Farticle%2Fimg%2Flinter-success.png%3Fraw%3Dtrue" alt="success linter"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If you have any questions feel free to ask in a comment section, I will do my best to provide a comprehensive answer for you. &lt;/p&gt;

&lt;p&gt;Question to you: do you have linter checks as a part of your CI/CD?&lt;/p&gt;

&lt;h2&gt;
  
  
  Contact
&lt;/h2&gt;

&lt;p&gt;If you found this article helpful, I invite you to connect with me on &lt;a href="https://www.linkedin.com/in/iamtodor/" rel="noopener noreferrer"&gt;LinkedIn&lt;/a&gt;. I am always looking to expand my network and connect with like-minded individuals in the data industry. Additionally, you can also reach out to me for any questions or feedback on the article. I'd be more than happy to engage in a conversation and help out in any way I can. So don’t hesitate to contact me, and let’s connect and learn together.&lt;/p&gt;

</description>
      <category>python</category>
      <category>github</category>
      <category>cicd</category>
      <category>linter</category>
    </item>
  </channel>
</rss>
