<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: biellls</title>
    <description>The latest articles on Forem by biellls (@biellls).</description>
    <link>https://forem.com/biellls</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F256646%2Fa35402cf-c649-402f-8216-e0ff63bbf0fb.png</url>
      <title>Forem: biellls</title>
      <link>https://forem.com/biellls</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/biellls"/>
    <language>en</language>
    <item>
      <title>Impress your friends! Make a serverless bot that sends daily jokes to a Telegram Group</title>
      <dc:creator>biellls</dc:creator>
      <pubDate>Thu, 31 Mar 2022 20:29:49 +0000</pubDate>
      <link>https://forem.com/biellls/impress-your-friends-make-a-serverless-bot-that-sends-daily-jokes-to-a-telegram-group-493n</link>
      <guid>https://forem.com/biellls/impress-your-friends-make-a-serverless-bot-that-sends-daily-jokes-to-a-telegram-group-493n</guid>
      <description>&lt;p&gt;&lt;a href="https://github.com/typhoon-data-org/typhoon-orchestrator" rel="noopener noreferrer"&gt;Typhoon Orchestrator&lt;/a&gt; is a great way to deploy ETL workflow on AWS Lambda. In this tutorial we intend to show how easy to use and versatile it is by deploying code to Lambda that gets a random joke from &lt;a href="https://jokeapi.dev" rel="noopener noreferrer"&gt;https://jokeapi.dev&lt;/a&gt; once a day and sends it to your telegram group.&lt;/p&gt;

&lt;h2&gt;
  
  
  Getting started
&lt;/h2&gt;

&lt;p&gt;The first thing you need to do is install typhoon and the rest of the dependencies needed for this tutorial, preferrably in a virtual environment.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;typhoon-orchestrator[dev]
pip &lt;span class="nb"&gt;install &lt;/span&gt;python-telegram-bot
pip &lt;span class="nb"&gt;install &lt;/span&gt;requests
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Next we create our project, we will call our project jester (we could call it anything).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;typhoon init jester &lt;span class="nt"&gt;--template&lt;/span&gt; minimal
&lt;span class="nb"&gt;cd &lt;/span&gt;jester
typhoon status
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Notice that the status command gives us the following warning: &lt;code&gt;Connections YAML not found. To add connections create connections.yml&lt;/code&gt;. This is normal because typhoon normally uses a metadata database where you can store connections and variables, but we don’t want to create and use any DynamoDB tables for this tutorial so we used the minimal template that doesn’t include anything related to the metadata database. If you see any warnings about the metadata database during the course of the tutorial don’t worry, it’s for the same reason.&lt;/p&gt;

&lt;h2&gt;
  
  
  Tell me a joke!
&lt;/h2&gt;

&lt;p&gt;Before we worry about telegram, let’s create a workflow that calls the joke API and prints the joke on your CLI. Create the file: &lt;code&gt;dags/send_me_a_joke.yml&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;send_me_a_joke&lt;/span&gt;
&lt;span class="na"&gt;schedule_interval&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;@daily'&lt;/span&gt;

&lt;span class="na"&gt;tasks&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;get_joke&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;function&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;typhoon.http.get_raw&lt;/span&gt;
    &lt;span class="na"&gt;args&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;url&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;https://v2.jokeapi.dev/joke/Programming?blacklistFlags=nsfw,religious,political,racist,sexist,explicit&amp;amp;type=single&lt;/span&gt;

  &lt;span class="na"&gt;select_joke_text&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;input&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;get_joke&lt;/span&gt;
    &lt;span class="na"&gt;function&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;typhoon.json.search&lt;/span&gt;
    &lt;span class="na"&gt;args&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;data&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;!Py&lt;/span&gt; &lt;span class="s"&gt;$BATCH.response.json()&lt;/span&gt;
      &lt;span class="na"&gt;expression&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;joke&lt;/span&gt;

  &lt;span class="na"&gt;tell_joke&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;input&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;select_joke_text&lt;/span&gt;
    &lt;span class="na"&gt;function&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;typhoon.debug.echo&lt;/span&gt;
    &lt;span class="na"&gt;args&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;joke&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;!Py&lt;/span&gt; &lt;span class="s"&gt;$BATCH&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This workflow has three tasks using built-in functions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;get_joke&lt;/strong&gt;: Calls the joke API and gets a response like to the following:&lt;br&gt;
&lt;/p&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"error"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"category"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Programming"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"single"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"joke"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"A man is smoking a cigarette and blowing smoke rings into the air. His girlfriend becomes irritated with the smoke and says &lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;Can't you see the warning on the cigarette pack? Smoking is hazardous to your health!&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt; to which the man replies, &lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;I am a programmer.  We don't worry about warnings; we only worry about errors.&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"flags"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"nsfw"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"religious"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"political"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"racist"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"sexist"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"explicit"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;38&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"safe"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"lang"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"en"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;select_joke_text&lt;/strong&gt;: Uses a &lt;a href="https://jmespath.org/" rel="noopener noreferrer"&gt;JMESPath&lt;/a&gt; expression to select data from the JSON text.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;tell_joke&lt;/strong&gt;: Prints the joke text.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The &lt;code&gt;!Py&lt;/code&gt; tag means that instead of passing it a YAML object, you are passing it a string representing python code to run. For example, &lt;code&gt;foo: 4&lt;/code&gt; is equivalent to &lt;code&gt;foo: !Py 2+2&lt;/code&gt;.  &lt;code&gt;$BATCH&lt;/code&gt; is a special variable that holds whatever the previous function returned or yielded. In the case of the &lt;code&gt;select_joke_test&lt;/code&gt; task where the input is the &lt;code&gt;get_joke&lt;/code&gt; task, its function returned a NamedTuple with a response and some metadata, so that &lt;code&gt;$BATCH.response&lt;/code&gt;is a requests.Response object.&lt;/p&gt;

&lt;p&gt;Lets run to see a joke in our terminal&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;typhoon dag run &lt;span class="nt"&gt;--dag-name&lt;/span&gt; send_me_a_joke
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Piece of cake! But here comes the interesting part...&lt;/p&gt;

&lt;h2&gt;
  
  
  I want the joke on telegram
&lt;/h2&gt;

&lt;p&gt;There is no built-in function in Typhoon to send a text to a telegram chat. Fortunately it’s very easy to extend Typhoon, so let’s make it ourselves.&lt;/p&gt;

&lt;p&gt;Create the following file &lt;code&gt;functions/msg.py&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;telegram&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;send_message_telegram&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;token&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;chat_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Given a telegram bot token, chat_id and message,
       send the message to that chat&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;bot&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;telegram&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Bot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;token&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;token&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Sending message &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; to &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;chat_id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;bot&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;send_message&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chat_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;chat_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;message&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And update the DAG file we created before at &lt;code&gt;dags/send_me_a_joke.yml&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;send_me_a_joke&lt;/span&gt;
&lt;span class="na"&gt;schedule_interval&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;0 10 * * *&lt;/span&gt;  &lt;span class="c1"&gt;# Send the joke at 10am every day&lt;/span&gt;

&lt;span class="na"&gt;tasks&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;get_joke&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;function&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;typhoon.http.get_raw&lt;/span&gt;
    &lt;span class="na"&gt;args&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;url&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;https://v2.jokeapi.dev/joke/Programming?blacklistFlags=nsfw,religious,political,racist,sexist,explicit&amp;amp;type=single&lt;/span&gt;

  &lt;span class="na"&gt;select_joke_text&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;input&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;get_joke&lt;/span&gt;
    &lt;span class="na"&gt;function&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;typhoon.json.search&lt;/span&gt;
    &lt;span class="na"&gt;args&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;data&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;!Py&lt;/span&gt; &lt;span class="s"&gt;$BATCH.response.json()&lt;/span&gt;
      &lt;span class="na"&gt;expression&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;joke&lt;/span&gt;

  &lt;span class="na"&gt;tell_joke&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;input&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;select_joke_text&lt;/span&gt;
    &lt;span class="na"&gt;function&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;functions.msg.send_message_telegram&lt;/span&gt;
    &lt;span class="na"&gt;args&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;message&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;!Py&lt;/span&gt; &lt;span class="s"&gt;$BATCH&lt;/span&gt;
      &lt;span class="na"&gt;token&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;!Var&lt;/span&gt; &lt;span class="s"&gt;telegram_token&lt;/span&gt;
      &lt;span class="na"&gt;chat_id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;!Var&lt;/span&gt; &lt;span class="s"&gt;chat_id&lt;/span&gt;

&lt;span class="na"&gt;requirements&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;python-telegram-bot&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;requests&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Notice that for the token and chat id we have the &lt;code&gt;!Var&lt;/code&gt; tag. This is because we don’t want to include a secret like a token in the code, so we will read it from a variable. If you are really perceptive you may be thinking: “Didn’t you say that we are using a minimal deployment where there is no metadata database to store variables on?” Yes, that’s 100% correct. Usually we would store variables in the metadata database. However, we will use the alternate method of storing variables which is using an environment variable that starts with &lt;code&gt;TYPHOON_VARIABLE_&lt;/code&gt;.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;To create a bot with the botfather and &lt;strong&gt;get a token&lt;/strong&gt; follow the official tutorial &lt;a href="https://core.telegram.org/bots#creating-a-new-bot" rel="noopener noreferrer"&gt;https://core.telegram.org/bots#creating-a-new-bot&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;To &lt;strong&gt;find out your chat ID&lt;/strong&gt; check out &lt;a href="https://stackoverflow.com/questions/32423837/telegram-bot-how-to-get-a-group-chat-id" rel="noopener noreferrer"&gt;https://stackoverflow.com/questions/32423837/telegram-bot-how-to-get-a-group-chat-id&lt;/a&gt;. Keep in mind that &lt;strong&gt;you can only add the bot to group chats, not private conversations&lt;/strong&gt;.
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;TYPHOON_VARIABLE_telegram_token&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"MY_SECRET_TELEGRAM_TOKEN"&lt;/span&gt;
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;TYPHOON_VARIABLE_chat_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"128332492187641"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now that we have everything ready, let’s send some jokes.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;typhoon dag run &lt;span class="nt"&gt;--dag-name&lt;/span&gt; send_me_a_joke
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If everything was correctly set up you should get the notification with a random programmer joke!&lt;/p&gt;

&lt;h2&gt;
  
  
  Aiming for the clouds
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Build and upload the workflow
&lt;/h3&gt;

&lt;p&gt;This is all well and good, but we want the bot to tell us a joke every day without needing to run the code locally. First of all let’s compile our code into a zip and upload it to S3 so that Lambda can use it. This can be a little tedious, but luckily Typhoon takes care of that for us. We need to tell it to which S3 bucket we want to deploy to. &lt;strong&gt;You will also need a configured &lt;a href="https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-profiles.html#cli-configure-profiles-create" rel="noopener noreferrer"&gt;AWS profile&lt;/a&gt;&lt;/strong&gt;. Open the &lt;code&gt;.typhoonremotes&lt;/code&gt; file and modify it to use your profile and S3 bucket.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[test]
aws-profile=myaws
s3-bucket=typhoon-orchestrator
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now that we have a remote called &lt;code&gt;test&lt;/code&gt; we are ready to create the zip files and push them to S3. You will need to have docker installed for this step because the dependencies need to be built in an OS that is compatible with the one Lambda is using, otherwise they won’t work. This is a very common source of problems that Typhoon helps you avoid. If you are sure that your OS is compatible you can add the flag &lt;code&gt;--build-deps-locally&lt;/code&gt;, but &lt;strong&gt;it is generally not recomended&lt;/strong&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;typhoon dag push --dag-name send_me_a_joke test
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This will have taken a very long time because Typhoon built all of the dependencies, but don’t worry &lt;strong&gt;updating the workflow code is much much faster since the dependencies are separated into a layer and don’t need to be re-deployed unless they change&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;test&lt;/code&gt; at the end tells it what remote to deploy to. In the future we could add a different production environment with its own remote.&lt;/p&gt;

&lt;p&gt;If you check your S3 bucket now you’ll find two files:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;The lambda code:&lt;/strong&gt; &lt;code&gt;typhoon_dag_builds/send_me_a_joke/lambda.zip&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;All the necessary dependencies&lt;/strong&gt;: &lt;code&gt;typhoon_dag_builds/send_me_a_joke/layer.zip&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgfjtgfcfs5q19ay8jvrp.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgfjtgfcfs5q19ay8jvrp.png" alt="S3 objects"&gt;&lt;/a&gt; &lt;/p&gt;

&lt;h3&gt;
  
  
  Deploying infrastructure
&lt;/h3&gt;

&lt;p&gt;For this part you will need to &lt;a href="https://learn.hashicorp.com/tutorials/terraform/install-cli" rel="noopener noreferrer"&gt;install and set up terraform&lt;/a&gt;. Learn more about infrastructure as code &lt;a href="https://learn.hashicorp.com/tutorials/terraform/infrastructure-as-code?in=terraform/aws-get-started" rel="noopener noreferrer"&gt;here&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Typhoon automatically creates some terraform files that describe all the necessary infrastructure&lt;/strong&gt; to create in order to deploy our workflow to AWS Lambda. This greatly simplifies the creation of all the necessary resources that you would otherwise need to create manually. More importantly, it provides you a starting point while also giving you full control to change the terraform files until you have the desired configuration.&lt;/p&gt;

&lt;p&gt;For this tutorial you just need to &lt;strong&gt;update the test variables file to include the S3 bucket name and some DAG info&lt;/strong&gt;. We can get the info for all the dags by running &lt;code&gt;typhoon dag info --json-output --indent 2&lt;/code&gt;, but in this case we will need to adapt it to include the necessary environment variables. This means that you will need to add the following to the file &lt;code&gt;terraform/test.tfvars&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;dag_info = {
    "send_me_a_joke": {
        "schedule_interval": "cron(0 10 * * ? *)",
        "environment": {
            "TYPHOON_VARIABLE_telegram_token": "MY_SECRET_TELEGRAM_TOKEN",
            "TYPHOON_VARIABLE_chat_id": "128332492187641"
        }
    }
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Notice how the schedule interval is in a different format than the one we defined. This is because Terraform maps to AWS resources, and AWS uses its own flavor of cron expressions which is incompatible with the standard Unix cron expressions used by tools like cron, crontab, Airflow and many more. Typhoon aims to be a framework that can deploy to many platforms (currently supports AWS Lambda and Airflow) so we decided to follow the industry standard instead of AWS’s. Luckily, when we run &lt;code&gt;typhoon dag info ...&lt;/code&gt; &lt;strong&gt;Typhoon converts it to AWS’s standard so you don’t need to do that yourself!&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Now we are ready to create the infrastructure with terraform.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;AWS_PROFILE&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;my-aws-profile
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;AWS_DEFAULT_REGION&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;eu-west-1
&lt;span class="nb"&gt;cd &lt;/span&gt;terraform
terraform init
terraform plan &lt;span class="nt"&gt;-var-file&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;test.tfvars &lt;span class="nt"&gt;-out&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;tfplan
terraform apply tfplan
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And voila! You can check all of the resources that have been created in AWS and take a moment to appreciate how much time we’ve saved.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fymarmji5bglvr7axo16r.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fymarmji5bglvr7axo16r.png" alt="AWS Lambda Function"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Let’s take it for a spin
&lt;/h2&gt;

&lt;p&gt;If everything worked correctly you will get a joke in your telegram chat at 10am, but we don’t want to wait that long, we want to hear one now! You could invoke the Lambda from the AWS console, but we will invoke it with Typhoon.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;typhoon dag run &lt;span class="nt"&gt;--dag-name&lt;/span&gt; send_me_a_joke &lt;span class="nb"&gt;test&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Hopefully you got a hilarious joke sent right to your group chat.&lt;/p&gt;

&lt;p&gt;This is the same command we used earlier to run the workflow locally, but with &lt;code&gt;test&lt;/code&gt;at the end specifying that we want to run it in the remote environment. This has invoked a lambda and shown you the logs. Actually, to be more precise, it has invoked a Lambda that has then invoked another Lambda and then invoked another Lambda. Why? Because Typhoon is asynchronous by default which means that as soon as a function returns or yields a batch we invoke a new Lambda to process it. This is useful because you can have a lot of tasks performing work in parallel. For example, imagine you have a workflow that reads FTP CSV files, zips them up and uploads to S3. The first task could list all the CSV files in the FTP and yields each path as a batch. Then the next task will compress them which can take a long time, but we actually invoked a new Lambda instance for each batch so we are processing them all in parallel!&lt;/p&gt;

&lt;p&gt;Notice how even though the workflow ran across three lambdas, you still got the full log in your terminal. Lambdas can be hard to monitor and debug, but Typhoon tries to make this process easier. This is why when you run a Typhoon DAG manually, it waits for a response so that it can print the logs. Every invocation will in turn also wait for the response of any Lambdas it invokes so you will end up with the full log no matter how many Lambda invocations the workflow ran on. It’s extremely useful to be able to see if the DAG is working correctly, but it does introduce synchronicity so the DAG will run slower. We believe it’s a worthwile tradeoff for manual invocations. &lt;strong&gt;Rest assured that when the workflow is triggered on schedule it will run at full speed&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why can’t I just run everything in one lambda?
&lt;/h2&gt;

&lt;p&gt;Great question, and there’s no reason not to since our worflow is very light and doesn’t benefit from parallelism. You just need to modify the first two tasks to make them synchronous with &lt;code&gt;asynchronous: False&lt;/code&gt;. This is the relevant part of the code:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;tasks&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;get_joke&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;function&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;typhoon.http.get_raw&lt;/span&gt;
    &lt;span class="na"&gt;asynchronous&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;
    &lt;span class="na"&gt;args&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;url&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;https://v2.jokeapi.dev/joke/Programming?blacklistFlags=nsfw,religious,political,racist,sexist,explicit&amp;amp;type=single&lt;/span&gt;

  &lt;span class="na"&gt;select_joke_text&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;input&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;get_joke&lt;/span&gt;
    &lt;span class="na"&gt;function&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;typhoon.json.search&lt;/span&gt;
    &lt;span class="na"&gt;asynchronous&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;
    &lt;span class="na"&gt;args&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;data&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;!Py&lt;/span&gt; &lt;span class="s"&gt;$BATCH.response.json()&lt;/span&gt;
      &lt;span class="na"&gt;expression&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;joke&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Lets build and deploy the code, this time without dependencies by using the flag &lt;code&gt;--code&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;typhoon dag push &lt;span class="nt"&gt;--dag-name&lt;/span&gt; send_me_a_joke &lt;span class="nb"&gt;test&lt;/span&gt; &lt;span class="nt"&gt;--code&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Wow, that was much faster! You can see that once the workflow has been deployed one time with all the dependencies, &lt;strong&gt;making changes and deploying them is very fast and easy&lt;/strong&gt;. Feel free to run the DAG again to check out how only one Lambda will be invoked now.&lt;/p&gt;

&lt;h2&gt;
  
  
  This is good to be true, can I really build all my ETLs like this?
&lt;/h2&gt;

&lt;p&gt;Yes and no... Depending on your use case Lambda can be a good fit, but there are currently some limitations to this approach:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Lambdas can only run for 15 minutes&lt;/strong&gt;. If you have a long running task this will not work for you. In the future &lt;strong&gt;we intend to support Fargate to run heavier tasks&lt;/strong&gt; and solve this issue.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Can we really do away with the scheduler?&lt;/strong&gt; We have shown you a utopian vision of the future of ETLs. It still remains to be seen if we can fully avoid running a scheduler, and we may run into the harsh reality that if you want to be able to implement sensors, rate-limit tasks, etc. we may need a scheduler. Even if that turns out to be true, it would always be opt-in and much simpler than a traditional one.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Does that mean that Typhoon is not ready for prime time?
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Absolutely not!&lt;/strong&gt; We may have a long (albeit exciting) path ahead to realize our vision of a battle tested, fully serverless, asynchronous workflow orchestrator, but AWS is not the only target. &lt;strong&gt;Typhoon supports compilation to native Airflow code&lt;/strong&gt;, the most popular orchestrator around today. This feature can bridge the gap between the simplicity of our vision and the complex reality we currently live in as Data Engineers.&lt;/p&gt;

&lt;p&gt;Our hope is that you will use Typhoon and fall in love with the simplicity of our vision, and deploy to Airflow if the current state of AWS deployment can’t meet your needs.&lt;/p&gt;

&lt;h2&gt;
  
  
  Cleaning up
&lt;/h2&gt;

&lt;p&gt;If you want to clean up all the resources that were created on this tutorial run the following command:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;terraform plan &lt;span class="nt"&gt;-var-file&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;test.tfvars &lt;span class="nt"&gt;-out&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;tfplan &lt;span class="nt"&gt;-destroy&lt;/span&gt;
terraform apply &lt;span class="nt"&gt;-destroy&lt;/span&gt; tfplan
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Thanks for following along!
&lt;/h2&gt;

&lt;p&gt;If you enjoyed this tutorial we hope to see you soon at &lt;a href="https://github.com/typhoon-data-org/typhoon-orchestrator" rel="noopener noreferrer"&gt;https://github.com/typhoon-data-org/typhoon-orchestrator&lt;/a&gt;. Check out the code, leave a star, open an issue or come say hi on our discord!&lt;/p&gt;

</description>
      <category>aws</category>
      <category>serverless</category>
      <category>programming</category>
      <category>python</category>
    </item>
    <item>
      <title>Modern data warehouse patterns: ELT with Snowflake variants </title>
      <dc:creator>biellls</dc:creator>
      <pubDate>Sat, 29 Jan 2022 19:02:14 +0000</pubDate>
      <link>https://forem.com/biellls/modern-data-warehouse-patterns-elt-with-snowflake-variants-26b4</link>
      <guid>https://forem.com/biellls/modern-data-warehouse-patterns-elt-with-snowflake-variants-26b4</guid>
      <description>&lt;h2&gt;
  
  
  Leveraging semi-structured data for resilience against schema changes
&lt;/h2&gt;

&lt;p&gt;As data warehouse technologies get cheaper and better, ELT is gaining momentum over ETL. In this article we will show you how to leverage Snowflake's semi-structured data to build integrations that are highly resistant to changes in schema while staying performant. Schema changes are one of the most common things that can break a data pipeline (adding and removing fields, changes in types or length of the data etc.) so it is extremely useful to protect yourself against them.&lt;/p&gt;

&lt;h2&gt;
  
  
  Real world example- Personal information
&lt;/h2&gt;

&lt;p&gt;Let's assume we have a table with basic information about our clients. The goal is to load the information into snowflake unchanged.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;name&lt;/th&gt;
&lt;th&gt;surname&lt;/th&gt;
&lt;th&gt;age&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Anne&lt;/td&gt;
&lt;td&gt;Houston&lt;/td&gt;
&lt;td&gt;38&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;John&lt;/td&gt;
&lt;td&gt;Doe&lt;/td&gt;
&lt;td&gt;22&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;William&lt;/td&gt;
&lt;td&gt;Williams&lt;/td&gt;
&lt;td&gt;27&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;We would usually create the following table in Snowflake:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;clients&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="nb"&gt;VARCHAR&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;surname&lt;/span&gt; &lt;span class="nb"&gt;VARCHAR&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;age&lt;/span&gt; &lt;span class="n"&gt;NUMBER&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Notice how we don't specify the varchar's length or the number's precision and scale. This is preferable because snowflake will automatically use the minimum size needed to store the data efficiently, and if the source system changes the length of a varchar, or the precision of a number your flows won't break. An exception is when a number has decimals we will need to specify a precision and scale.&lt;/p&gt;

&lt;p&gt;But if we do that, our integration will fail if a field is removed, and if a field is added we won't notice. We are not resilient to schema changes. To solve that we will instead create a table with just one variant field where we will load all the data, no matter what fields it has.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;clients_raw&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;src&lt;/span&gt; &lt;span class="n"&gt;VARIANT&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In order to load the data we can dump it as JSON into a stage.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;OR&lt;/span&gt; &lt;span class="k"&gt;REPLACE&lt;/span&gt; &lt;span class="n"&gt;FILE&lt;/span&gt; &lt;span class="n"&gt;FORMAT&lt;/span&gt; &lt;span class="n"&gt;json_format&lt;/span&gt; &lt;span class="k"&gt;TYPE&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;JSON&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;OR&lt;/span&gt; &lt;span class="k"&gt;REPLACE&lt;/span&gt; &lt;span class="n"&gt;STAGE&lt;/span&gt; &lt;span class="n"&gt;mystage&lt;/span&gt; &lt;span class="n"&gt;FILE_FORMAT&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json_format&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Let's create a file with some JSON data to load into the table. Run the following in a shell:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s1"&gt;'{"name": "Anne", "surname": "Houston", "age": 37}'&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; /tmp/data.json
&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s1"&gt;'{"name": "John", "surname": "Doe", "age": 21}'&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; /tmp/data.json
&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s1"&gt;'{"name": "William", "surname": "Williams", "age": 26}'&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; /tmp/data.json
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Next we run this in snowflake to upload the data to a stage and then load the data into the table:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="n"&gt;put&lt;/span&gt; &lt;span class="n"&gt;file&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="o"&gt;///&lt;/span&gt;&lt;span class="n"&gt;tmp&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="k"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;json&lt;/span&gt; &lt;span class="o"&gt;@&lt;/span&gt;&lt;span class="n"&gt;mystage&lt;/span&gt;
&lt;span class="k"&gt;COPY&lt;/span&gt; &lt;span class="k"&gt;INTO&lt;/span&gt; &lt;span class="n"&gt;clients_raw&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="o"&gt;@&lt;/span&gt;&lt;span class="n"&gt;mystage&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="k"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;json&lt;/span&gt; &lt;span class="n"&gt;FILE_FORMAT&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json_format&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We can now query the data as:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;min&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;src&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;age&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;age&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="n"&gt;clients_raw&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Creating a view
&lt;/h3&gt;

&lt;p&gt;It is easy to query the data, but it can be verbose and a little confusing to analysts who have never worked with unstructured data. In order to make it transparent to the end user we can create a view that turns it into structured data.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;VIEW&lt;/span&gt; &lt;span class="n"&gt;clients&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt;
    &lt;span class="n"&gt;src&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nb"&gt;VARCHAR&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;src&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;surname&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nb"&gt;VARCHAR&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;surname&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;src&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;age&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nb"&gt;NUMERIC&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;age&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;clients_raw&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The same query from before would now be:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;min&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;age&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;age&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="n"&gt;clients&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And now it's indistinguishable from a structured table from the user’s point of view.&lt;/p&gt;

&lt;h3&gt;
  
  
  Removing a column, adding a column
&lt;/h3&gt;

&lt;p&gt;Suppose that database admins realized that storing age in a column is not ideal, since it needs to be updated every time a client has a birthday. Instead he decides to drop the age column and store a date with their birthday. The new table is as follows:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;name&lt;/th&gt;
&lt;th&gt;surname&lt;/th&gt;
&lt;th&gt;birthday&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Anne&lt;/td&gt;
&lt;td&gt;Houston&lt;/td&gt;
&lt;td&gt;1984-03-12&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;John&lt;/td&gt;
&lt;td&gt;Doe&lt;/td&gt;
&lt;td&gt;2000-01-03&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;William&lt;/td&gt;
&lt;td&gt;Williams&lt;/td&gt;
&lt;td&gt;1995-02-04&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Let's create the new data:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s1"&gt;'{"name": "Anne", "surname": "Houston", "birthday": "1984-03-12"}'&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; /tmp/data.json
&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s1"&gt;'{"name": "John", "surname": "Doe", "birthday": "2000-01-03"}'&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; /tmp/data.json
&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s1"&gt;'{"name": "William", "surname": "Williams", "birthday": "1995-02-04"}'&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; /tmp/data.json
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We would usually append the data, but to make this tutorial simple we will just replace the old data with the new one.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;TRUNCATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;clients_raw&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;COPY&lt;/span&gt; &lt;span class="k"&gt;INTO&lt;/span&gt; &lt;span class="n"&gt;clients_raw&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="o"&gt;@&lt;/span&gt;&lt;span class="n"&gt;mystage&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="k"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;json&lt;/span&gt; &lt;span class="n"&gt;FILE_FORMAT&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json_format&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="n"&gt;put&lt;/span&gt; &lt;span class="n"&gt;file&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="o"&gt;///&lt;/span&gt;&lt;span class="n"&gt;tmp&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="k"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;json&lt;/span&gt; &lt;span class="o"&gt;@&lt;/span&gt;&lt;span class="n"&gt;mystage&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Since we store all available data as a variant our integration will not break. The view would not break either, but the age would show as null (try &lt;code&gt;SELECT * FROM clients&lt;/code&gt;). The only thing we need to do to take advantage of the new field is to update the view. For backwards compatibility we will still include the age as a calculation.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;OR&lt;/span&gt; &lt;span class="k"&gt;REPLACE&lt;/span&gt; &lt;span class="k"&gt;VIEW&lt;/span&gt; &lt;span class="n"&gt;clients&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt;
    &lt;span class="n"&gt;src&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nb"&gt;VARCHAR&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;src&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;surname&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nb"&gt;VARCHAR&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;surname&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;src&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;birthday&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nb"&gt;DATE&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;birthday&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;DATEDIFF&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'years'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;src&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;birthday&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;CURRENT_DATE&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;age&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;clients_raw&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's it, our pipelines never broke and there’s no need to change our data flows or source table definitions!&lt;/p&gt;

&lt;p&gt;If a new field gets added to the table and no one notices it's still getting staged into snowflake in the variant so the moment someone requests the field in the view we'll be able to see it, without needing to backfill the data.&lt;/p&gt;

&lt;h3&gt;
  
  
  Doesn't this take up more space than regular tables? Isn't it slower to query?
&lt;/h3&gt;

&lt;p&gt;This excerpt from &lt;a href="https://docs.snowflake.com/en/user-guide/semistructured-considerations.html#storing-semi-structured-data-in-a-variant-column-vs-flattening-the-nested-structure"&gt;Snowflake's docs&lt;/a&gt; answers the question:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;For data that is mostly regular and uses only native types (strings and integers), the storage requirements and query performance for operations on relational data and data in a VARIANT column is very similar.&lt;br&gt;
For better pruning and less storage consumption, we recommend flattening your object and key data into separate relational columns if your semi-structured data includes:&lt;/p&gt;
&lt;/blockquote&gt;

&lt;ul&gt;
&lt;li&gt;Dates and timestamps, especially non-ISO 8601dates and timestamps, as string values&lt;/li&gt;
&lt;li&gt;Numbers within strings&lt;/li&gt;
&lt;li&gt;Arrays&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Non-native values such as dates and timestamps are stored as strings when loaded into a VARIANT column, so operations on these values could be slower and also consume more space than when stored in a relational column with the corresponding data type.&lt;/p&gt;

&lt;blockquote&gt;
&lt;/blockquote&gt;

&lt;p&gt;So in terms of performance and storage it should be really similar albeit a little slower. An exception would be if we need to query the birthday because it's stored as a string, as we will see in the following section.&lt;/p&gt;

&lt;h3&gt;
  
  
  Improving performance
&lt;/h3&gt;

&lt;p&gt;Because variants store dates as strings, they are not as efficient to filter by. This is only an issue if the table is large and you intend to query the table by that date. Let's see an example of how to improve performance in that case:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;OR&lt;/span&gt; &lt;span class="k"&gt;REPLACE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;clients_raw&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;src&lt;/span&gt; &lt;span class="n"&gt;VARIANT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;birthday&lt;/span&gt; &lt;span class="nb"&gt;DATE&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="k"&gt;COPY&lt;/span&gt; &lt;span class="k"&gt;INTO&lt;/span&gt; &lt;span class="n"&gt;clients_raw&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="k"&gt;select&lt;/span&gt;
        &lt;span class="err"&gt;$&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;src&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;to_date&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="err"&gt;$&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;birthday&lt;/span&gt;&lt;span class="p"&gt;)::&lt;/span&gt;&lt;span class="nb"&gt;DATE&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;birthday&lt;/span&gt; 
    &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="o"&gt;@&lt;/span&gt;&lt;span class="n"&gt;mystage&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="k"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;json&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;FILE_FORMAT&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json_format&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And modify the view:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;OR&lt;/span&gt; &lt;span class="k"&gt;REPLACE&lt;/span&gt; &lt;span class="k"&gt;VIEW&lt;/span&gt; &lt;span class="n"&gt;clients&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt;
    &lt;span class="n"&gt;src&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nb"&gt;VARCHAR&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;src&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;surname&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nb"&gt;VARCHAR&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;surname&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;birthday&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;     &lt;span class="o"&gt;//&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="c1"&gt;-- We changed this to get the column directly&lt;/span&gt;
    &lt;span class="n"&gt;DATEDIFF&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'years'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;birthday&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;CURRENT_DATE&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;age&lt;/span&gt;  &lt;span class="o"&gt;//&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="c1"&gt;-- Here too&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;clients_raw&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now queries filtering by birthday (or getting &lt;code&gt;MAX(birthday)&lt;/code&gt; for example) will be much faster.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is the best way to load the data?
&lt;/h2&gt;

&lt;p&gt;The most efficient way to load the data into a table is by using a COPY command since Snowflake can optimize a bulk load. It can't do that with insert statements. Here are some of the most popular ways to load the data into snowflake, each with their advantges and disadvantages:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;CSV: A gzipped CSV is the fastest way to load structured data into snowflake. It takes more space than parquet. It can also be loaded into a variant column with the right casting (see example later). The data can not be loaded easily into a variant.&lt;/li&gt;
&lt;li&gt;JSON: Can be easily loaded into a variant, but it takes a lot of space in your data lake.&lt;/li&gt;
&lt;li&gt;Avro: Built-in schema, easily loaded into a variant or into a structured table. Takes more space than parquet.&lt;/li&gt;
&lt;li&gt;Parquet: Columnar storage that has a better compression than the other options and can easily be loaded into a structured or unstructured table. It is slower than CSV to load into a structured table.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Bottom Line
&lt;/h2&gt;

&lt;p&gt;Loading data into Snowflake using this method is a great way to save you a lot of headaches and minimise data pipeline failures. It is a good rule of thumb to always use this method unless you will be loading an extremely large amount of data and need the extra 20% performance that a  structured table will give you. If you decide to create a structured table instead of using this method be aware that the pipelines will break on any schema change.&lt;/p&gt;

&lt;h2&gt;
  
  
  What are the best tools to load data like this?
&lt;/h2&gt;

&lt;p&gt;Any ETL/ELT tool that is flexible enough, for instance Airflow can be adapted to use this method. You can also check out our ETL tool that encourages this pattern and other modern best practices for data engineering &lt;a href="https://github.com/typhoon-data-org/typhoon-orchestrator"&gt;https://github.com/typhoon-data-org/typhoon-orchestrator&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Sources
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://www.snowflake.com/wp-content/uploads/2015/06/Snowflake_Semistructured_Data_WP_1_0_062015.pdf"&gt;https://www.snowflake.com/wp-content/uploads/2015/06/Snowflake_Semistructured_Data_WP_1_0_062015.pdf&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://docs.snowflake.com/en/user-guide/semistructured-considerations.html#storing-semi-structured-data-in-a-variant-column-vs-flattening-the-nested-structure"&gt;https://docs.snowflake.com/en/user-guide/semistructured-considerations.html#storing-semi-structured-data-in-a-variant-column-vs-flattening-the-nested-structure&lt;/a&gt;&lt;/p&gt;

</description>
      <category>snowflake</category>
      <category>dataengineering</category>
      <category>etl</category>
      <category>elt</category>
    </item>
    <item>
      <title>Standing on the shoulders of giants. Part one: Airflow</title>
      <dc:creator>biellls</dc:creator>
      <pubDate>Fri, 21 Jan 2022 20:01:29 +0000</pubDate>
      <link>https://forem.com/biellls/standing-on-the-shoulders-of-giants-part-one-airflow-2ac7</link>
      <guid>https://forem.com/biellls/standing-on-the-shoulders-of-giants-part-one-airflow-2ac7</guid>
      <description>&lt;p&gt;Airflow advanced the state of the art in ETL tools by providing an extremely flexible and reliable framework. It is easy to monitor your jobs and you can extend it with plugins to do anything that python can do. It also helped introduce the concept of functional batch data pipelines. By removing state from pipelines and enforcing strict boundaries on partitions of time you can more easily reprocess a partition of data without affecting the rest of it. If you're unfamiliar with that concept, &lt;a href="https://maximebeauchemin.medium.com/functional-data-engineering-a-modern-paradigm-for-batch-data-processing-2327ec32c42a"&gt;this article&lt;/a&gt; by Airflow's creator Maxime Beauchemin is worth a read.&lt;/p&gt;

&lt;p&gt;With &lt;a href="https://github.com/typhoon-data-org/typhoon-orchestrator"&gt;Typhoon&lt;/a&gt; we aim to build on this concept to provide a framework with a stronger focus on software engineering principles. We will illustrate it in the following sections&lt;/p&gt;

&lt;h2&gt;
  
  
  Built for developers
&lt;/h2&gt;

&lt;p&gt;Typhoon was built from the ground up to provide a great experience for developers. Besides providing great &lt;a href="https://typhoon-data-org.github.io/typhoon-orchestrator/index.html#auto-completion"&gt;Intellisense&lt;/a&gt; it helps you implement software best practices.&lt;/p&gt;

&lt;h2&gt;
  
  
  Testable
&lt;/h2&gt;

&lt;p&gt;Airflow is notoriously hard to test. Operators force coupling between logic, execution context and framework. Let's look at a really simple example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;ExchangeRates&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;BaseOperator&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;base&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;symbols&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Optional&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;List&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;base&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;base&lt;/span&gt;
        &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;symbols&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;symbols&lt;/span&gt;
        &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;http_conn_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;http_conn_id&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;params&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="s"&gt;'start_at'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;'execution_date'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
            &lt;span class="s"&gt;'end_at'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;'next_execution_date'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="n"&gt;full_endpoint&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="s"&gt;'&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;ENDPOINT&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/history'&lt;/span&gt;
        &lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="s"&gt;'Calling endpoint &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;full_endpoint&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; for dates between &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;start_at&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;, &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;end_at&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;base&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;params&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;'base'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;base&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;symbols&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;params&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;'symbols'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;symbols&lt;/span&gt;
        &lt;span class="n"&gt;hook&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;HttpsHook&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;'get'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;http_conn_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;http_conn_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;hook&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;full_endpoint&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;params&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;params&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;'task_instance'&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;xcom_push&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;'response'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;All your logic is in the execute function of your opertor, so in order to run a test you need to import airflow and create an instance of the operator. Not only that, but we need to provide a context similar to the one that airflow would provide. Finally, you would need to mock xcom and see that it's called with the value you expect it is. This is only a simple example but it can get much more complex once there is a source and a destination in the same component, magic macro rendering and more. Just in case this doesn't sound complex enough, notice we create a hook from its connection id. Yeah, you'll need to mock the airflow database too or spin up a temporary one. Good luck with that.&lt;/p&gt;

&lt;p&gt;In contrast, the logic for typhoon tasks lives inside regular python functions. They don't make use of the framework unless they use a hook and even then it can be instantiated without a metadata database.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_exchange_rates&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;hook&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;HTTPHook&lt;/span&gt;
        &lt;span class="n"&gt;start_at&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;end_at&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;base&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Optional&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;symbols&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Optional&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;List&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;params&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="s"&gt;'start_at'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;start_at&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="s"&gt;'end_at'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;end_at&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="n"&gt;full_endpoint&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="s"&gt;'&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;ENDPOINT&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/history'&lt;/span&gt;
    &lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="s"&gt;'Calling endpoint &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;full_endpoint&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; for dates between &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;start_at&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;, &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;end_at&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;base&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;params&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;'base'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;base&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;symbols&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;params&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;'symbols'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;symbols&lt;/span&gt;
    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;full_endpoint&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;params&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;params&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Testing this is as easy as:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;test_xr_get_history&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;symbols&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;'EUR'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;'PHP'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;'HKD'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;start_at&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;date&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2020&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;end_at&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;date&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2020&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;hook&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;HTTPSHook&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ConnParams&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;conn_type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;'https_hook'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;extra&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="s"&gt;'method'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s"&gt;'get'&lt;/span&gt;&lt;span class="p"&gt;}))&lt;/span&gt;
    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;exchange_rates_api&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;get_history&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;hook&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;hook&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;start_at&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;start_at&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;end_at&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;end_at&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;base&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;'USD'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;symbols&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;symbols&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="nb"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;keys&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="s"&gt;'rates'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;'start_at'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;'end_at'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;'base'&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="nb"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;'rates'&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;keys&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;start_at&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;isoformat&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;end_at&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;isoformat&lt;/span&gt;&lt;span class="p"&gt;()}&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;'rates'&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;items&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
        &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="nb"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;keys&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="nb"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;symbols&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We don't need to import the framework, mock anything or have a database running. We just give it some input and test the output. This takes the functional aspect in functional data pipelines even further.&lt;/p&gt;

&lt;h2&gt;
  
  
  Composable
&lt;/h2&gt;

&lt;p&gt;Composability is one of the principles of good software engineering because it enables you to reuse existing functions or objects in order to achieve new behaviour. Airflow gets in the way of that by coupling context, as we explained in the previous section, but also by encouraging task isolation. Tasks can't pass data between them, only some metadata through XCom and even that is discouraged. That means that you can't have an FTPExtractOperator and an S3LoadOperator, you need an FTPToS3Operator and every other possible combination of sources and destinations. This does not compose well as you end up with a lot of repeated code across different operators just because you can't easily reuse the logic.&lt;/p&gt;

&lt;p&gt;In typhoon tasks can pass any data between them without any performance penalty. You can have a function that extracts data from a source and another one that loads into a destination. You can reuse those functions in any other DAG that uses that source or destination.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;example&lt;/span&gt;
&lt;span class="na"&gt;schedule_interval&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;rate(1 day)&lt;/span&gt;

&lt;span class="na"&gt;tasks&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; 
    &lt;span class="na"&gt;extract_files&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;component&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;typhoon.get_data_from_files&lt;/span&gt;
        &lt;span class="na"&gt;args&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;hook&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;!Hook&lt;/span&gt; &lt;span class="s"&gt;my_ftp&lt;/span&gt;
            &lt;span class="na"&gt;pattern&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;/base/path/*.csv&lt;/span&gt;

    &lt;span class="na"&gt;load_files&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;input&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;extract_files&lt;/span&gt;
        &lt;span class="na"&gt;function&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;typhoon.filesystem.write_data&lt;/span&gt;
        &lt;span class="na"&gt;args&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;hook&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;!Hook&lt;/span&gt; &lt;span class="s"&gt;my_s3&lt;/span&gt;
            &lt;span class="na"&gt;data&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;!Py&lt;/span&gt; &lt;span class="s"&gt;$BATCH.data&lt;/span&gt;
            &lt;span class="na"&gt;path&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;!MultiStep&lt;/span&gt;
                &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="kt"&gt;!Py&lt;/span&gt; &lt;span class="s"&gt;typhoon.files.name($BATCH.path)&lt;/span&gt;
                &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="kt"&gt;!Py&lt;/span&gt; &lt;span class="s"&gt;f'/some/path/{$1}'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Extensible
&lt;/h2&gt;

&lt;p&gt;There are several ways in which the framework facilitates extension.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Just python&lt;/strong&gt;. One of typhoon's goals is to be easily extensible with regular python code. You can create python functions and call them in your DAGs.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Interfaces&lt;/strong&gt;. Hooks are grouped into interfaces in a lot of cases where it makes sense to make them interchangeable. This means you can easily switch a hook that writes to files in your OS for local development into an S3 hook for the integration tests and production. More importantly, since a lot of functions take a hook of a specific interface, if you create a new hook that conforms to that interface it will automatically be compatible with all those functions.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Natively support additional connection types&lt;/strong&gt;. When you create a new kind of hook and give it a conn_type, this will be used to discriminate the class when a hook instance is created from a  connection defined in the metadata.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Quick feedback
&lt;/h2&gt;

&lt;p&gt;Typhoon aims to provide a lightning fast feedback loop on all steps of the DAG creation process. From debug hooks that print whatever is passed to them, to interchangeable hooks so you can easily develop, test and deploy against different targets, to being able to run the whole DAG from the command line instead of needing to schedule it or run independent tasks.&lt;/p&gt;

&lt;h2&gt;
  
  
  Debugging
&lt;/h2&gt;

&lt;p&gt;Typhoon is designed from the ground up to be easy to debug and it achieves this by compiling to regular python that can be executed locally and debugged from your favorite IDE.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try it out!
&lt;/h2&gt;

&lt;p&gt;If you're curious on what the future of data pipelines could look like check out Typhoon at &lt;a href="https://github.com/typhoon-data-org/typhoon-orchestrator"&gt;https://github.com/typhoon-data-org/typhoon-orchestrator&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>airflow</category>
      <category>dataengineering</category>
      <category>python</category>
      <category>data</category>
    </item>
    <item>
      <title>Compression: Clearing the Confusion on ZIP, GZIP, Zlib and DEFLATE</title>
      <dc:creator>biellls</dc:creator>
      <pubDate>Wed, 23 Oct 2019 21:14:13 +0000</pubDate>
      <link>https://forem.com/biellls/compression-clearing-the-confusion-on-zip-gzip-zlib-and-deflate-15g1</link>
      <guid>https://forem.com/biellls/compression-clearing-the-confusion-on-zip-gzip-zlib-and-deflate-15g1</guid>
      <description>&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;Roughly a month ago I found myself puzzled at work as to why 7-ZIP in Windows could not recognize GZIP files that we compressed in python. We used &lt;a href="https://docs.python.org/2/library/zlib.html" rel="noopener noreferrer"&gt;zlib&lt;/a&gt; library, which claims to be "Compression compatible with GZIP." It seemed there was more than meets the eye.&lt;/p&gt;

&lt;p&gt;After a few google searches that left me more confused about what was going on than I initially was I decided to just open a subprocess and use the GNU implementation of GZIP. The final code wasn't too long and compressed in a way that 7-ZIP as well as Snowflake were able to detect automatically. The implementation was quite succinct, so I didn't give it much thought.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;gnu_zip&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;bytes&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;bytes&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;p&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;subprocess&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;gzip&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="nb"&gt;input&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;stdout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;subprocess&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;PIPE&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;stderr&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;logging&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;error&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;stderr&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;Exception&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Error decompressing data&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;stdout&lt;/span&gt;


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Still, something bugged me. How did zlib and GZIP relate? Why did zlib claim to be compatible with GZIP when it was clearly not the same format? We also noticed that even though Snowflake was unable to detect the zlib compressed file, if we told it that it's a GZIPped file it was able to load it without issues. Clearly the claim of compatibility wasn't completely outlandish.  After a few weeks I decided to dive in and investigate in depth. The following is a summary of what I learned.&lt;/p&gt;

&lt;h2&gt;
  
  
  DEFLATE
&lt;/h2&gt;

&lt;p&gt;I was surprised to find out that GZIP, zlib or even ZIP are not compression algorithms, they are actually file formats that can permit different compression algorithms. Even more surprising, virtually every implementation of those three actually use the same lossless data compression algorithm. This algorithm is called DEFLATE.&lt;/p&gt;

&lt;p&gt;So are GZIP, zlib and ZIP actually the same? Not quite, and the final size of the compressed file can vary significantly between ZIP and the other two for reasons we will see below, but under the hood the actual compression is done in exactly the same way.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://zlib.net/feldspar.html" rel="noopener noreferrer"&gt;How does DEFLATE work&lt;/a&gt;? In short, it takes some input data as a stream consisting of a series of blocks of data, then uses a combination of LZ77 algorithm and Huffman coding on each block.&lt;/p&gt;

&lt;p&gt;LZ77 identifies duplicate strings and replaces them with a back reference, which is a pointer to the place where it previously appeared, followed by the length of the string. This is done on the raw data blocks. For a more detailed explanation and example see &lt;strong&gt;Bonus section 1&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Huffman coding is known as bit reduction, and identifies the commonly used symbols and replaces them with symbols with shorter bit sequences. Infrequently used symbols will be represented with longer bit sequences. This is done on the LZ77 compressed blocks. For a more detailed explanation and example see &lt;strong&gt;Bonus section 2&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;See the following link for a more in depth explanation of deflate: &lt;a href="https://zlib.net/feldspar.html" rel="noopener noreferrer"&gt;https://zlib.net/feldspar.html&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  ZIP
&lt;/h2&gt;

&lt;p&gt;Released in 1989 and written by Phil Katz, ZIP is the oldest of the compression formats discussed in this article. It is also unique in that it's an archive file format, meaning it can compress multiple files and entire directory structures.&lt;/p&gt;

&lt;p&gt;ZIP applies DEFLATE compression separately to each file it stores and then keeps a central directory structure at the end. This means that it can provide random access to each file which can be read separately, but the final size is larger since the compression does not take advantage of redundancy across files.&lt;/p&gt;

&lt;p&gt;Finally, it also includes a CRC-32 checksum for data integrity.&lt;/p&gt;

&lt;h2&gt;
  
  
  GZIP
&lt;/h2&gt;

&lt;p&gt;After some patent disputes with the unix compress utility, the GZIP format was developed in 1992 by Jean-loup Gailly and Mark Adler using a new implementation of DEFLATE that did not infringe on any patents.&lt;/p&gt;

&lt;p&gt;Unlike ZIP, it is not an archive format. This means it can not compress several files or directories, it just compresses a single file or stream of data. That's why it's frequently combined with the tar utility which can create an archive of files, directories and their attributes in a single file which is then compressed with GZIP. This popular format is called tarball and its files end in .tar.gz. Tarballs do not provide access to the files contained, instead the whole file needs to be read and decompressed in memory before the directory structure can be shown.&lt;/p&gt;

&lt;p&gt;It has a  GZIP wrapper on the compressed data with the filename and other system information, and a CRC-32 checksum at the end to check the integrity of the data. On the other hand, the final size is usually smaller than zip since it does take advantage of redundancy across files.&lt;/p&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;

&lt;p&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;gzip&lt;/span&gt;&lt;br&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;io&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;BytesIO&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;gzip_data&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;bytes&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;bytes&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;br&gt;
    &lt;span class="n"&gt;out&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;BytesIO&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;br&gt;
    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;gzip&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;GzipFile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fileobj&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;out&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;mode&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;wb&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;br&gt;
        &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;write&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;br&gt;
    &lt;span class="n"&gt;out&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;seek&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;br&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;out&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getvalue&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;/p&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h2&gt;
&lt;br&gt;
  &lt;br&gt;
  &lt;br&gt;
  Zlib&lt;br&gt;
&lt;/h2&gt;

&lt;p&gt;The authors of GZIP later extracted its DEFLATE implementation into a library named &lt;a href="https://www.zlib.net/zlib_tech.html" rel="noopener noreferrer"&gt;zlib&lt;/a&gt; so it could be reused by other formats, most notably PNG images. PNG images replaced the GIF format that was plagued with the same patent issues as unix compress. It is the most popular DEFLATE implementation and is used by many existing programs. Most HTTP servers use zlib to compress their data.&lt;/p&gt;

&lt;p&gt;But that's not all, zlib has the option to use a GZIP wrapper on the compressed data or a lighter zlib wrapper. This means that apart from being a library, zlib can also be considered a compression format that has separate headers from other formats. This is the reason why our files were compatible with GZIP encoding but Snowflake couldn't auto detect them as GZIP, since they had zlib headers. This format is a light wrapper over raw deflate and does not contain a CRC checksum.&lt;/p&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;

&lt;p&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;zlib&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;zlib_data&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;bytes&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;bytes&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;br&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;zlib&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;compress&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;/p&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h2&gt;
&lt;br&gt;
  &lt;br&gt;
  &lt;br&gt;
  Comparison&lt;br&gt;
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;zlib&lt;/th&gt;
&lt;th&gt;GZIP&lt;/th&gt;
&lt;th&gt;ZIP&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Headers&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;0x78(01/9C/DA)&lt;/td&gt;
&lt;td&gt;1F8B&lt;/td&gt;
&lt;td&gt;504B0304&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Compression format&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;DEFLATE&lt;/td&gt;
&lt;td&gt;DEFLATE&lt;/td&gt;
&lt;td&gt;DEFLATE&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Checksum&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;td&gt;CRC-32&lt;/td&gt;
&lt;td&gt;CRC-32&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Lossless?&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Data&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Stream / single file&lt;/td&gt;
&lt;td&gt;Stream / single file&lt;/td&gt;
&lt;td&gt;Archive files and directories&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;You can check the file type with the following code:&lt;/p&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;

&lt;p&gt;&lt;span class="n"&gt;HEADERS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;br&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;zlib-no-compression&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;7801&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;&lt;br&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;zlib-default-compression&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;789c&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;&lt;br&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;zlib-best-compression&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;78da&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;&lt;br&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;gzip&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;1f8b&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;    &lt;span class="c1"&gt;# 1f8b08 if it's using deflate (almost always)&lt;br&gt;
&lt;/span&gt;    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;zip&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;504b0304&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;&lt;br&gt;
&lt;span class="p"&gt;)&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;compression_type&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;bytes&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;br&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;compression&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;header&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;HEADERS&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;br&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;hex&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;startswith&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;header&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;&lt;br&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;compression&lt;/span&gt;&lt;/p&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h2&gt;
&lt;br&gt;
  &lt;br&gt;
  &lt;br&gt;
  Bonus 1: LZ77 Algorithm&lt;br&gt;
&lt;/h2&gt;

&lt;p&gt;LZ77 is a lossless compression algorithm that replaces a sequence of symbols which had already appeared previously with a pointer to the place it last appeared and a number indicating the length of the sequence. The notation is &lt;b&gt; where &lt;strong&gt;&lt;em&gt;B&lt;/em&gt;&lt;/strong&gt; is the pointer that indicates how many symbols ago the sequence appeared, and &lt;strong&gt;&lt;em&gt;L&lt;/em&gt;&lt;/strong&gt; is the sequence length.&lt;/b&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fdwizxacl9v9kxr6mumd5.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fdwizxacl9v9kxr6mumd5.png" alt="Alt Text"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2F5ea7qv9lqroebn08wicl.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2F5ea7qv9lqroebn08wicl.png" alt="Alt Text"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;LZ77 does not keep a dictionary of sequences (in contrast to LZ78), but instead uses a &lt;strong&gt;sliding window&lt;/strong&gt; to search for them. This means that it only looks back inside the data up to a fixed distance (window). For a more detailed explanation click on the following &lt;a href="https://msdn.microsoft.com/en-us/library/ee916854.aspx" rel="noopener noreferrer"&gt;link&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Bonus 2: Huffman codes
&lt;/h2&gt;

&lt;p&gt;Huffman coding is also a lossless compression algorithm. The main idea is that symbols that appear more often should be encoded with less bits than symbols that appear little, resulting in a shorter file overall.&lt;/p&gt;

&lt;p&gt;Starting with the less used symbols, a leaf node is created for each of them and a tree is created by by joining them together with a parent node whose value is the sum of their frequencies. This process is repeated and we keep joining the less frequent symbols or subtrees until we get a final tree of which the leaves are the symbols. To know how to encode a symbol we need to traverse the subtree where each left node represents a 0 and each right node represents a 1 until we reach the leaf node representing the symbol.&lt;/p&gt;

&lt;p&gt;Example:&lt;/p&gt;

&lt;p&gt;(abridged from &lt;a href="https://www.thecrazyprogrammer.com/2014/09/huffman-coding-algorithm-with-example.html" rel="noopener noreferrer"&gt;https://www.thecrazyprogrammer.com/2014/09/huffman-coding-algorithm-with-example.html&lt;/a&gt;)&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;strong&gt;Symbol&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Frequency&lt;/strong&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;F&lt;/td&gt;
&lt;td&gt;12&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;D&lt;/td&gt;
&lt;td&gt;15&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;C&lt;/td&gt;
&lt;td&gt;7&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;E&lt;/td&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;A&lt;/td&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fovima8pad5z83zdtfqp2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fovima8pad5z83zdtfqp2.png" alt="Alt Text"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fgmwrvblo52dzjq1fnbhl.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fgmwrvblo52dzjq1fnbhl.png" alt="Alt Text"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fniytclypht7b5qtnkwce.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fniytclypht7b5qtnkwce.png" alt="Alt Text"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fribfjt19ye20j4ijjqhg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fribfjt19ye20j4ijjqhg.png" alt="Alt Text"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Other references
&lt;/h4&gt;

&lt;p&gt;Some great stack overflow answers by Mark Adler, co-author of GZIP and Zlib.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://stackoverflow.com/questions/19120676/how-to-detect-type-of-compression-used-on-the-file-if-no-file-extension-is-spe/19127748#19127748" rel="noopener noreferrer"&gt;https://stackoverflow.com/questions/19120676/how-to-detect-type-of-compression-used-on-the-file-if-no-file-extension-is-spe/19127748#19127748&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://stackoverflow.com/questions/19120676/how-to-detect-type-of-compression-used-on-the-file-if-no-file-extension-is-spe/19127748#19127748" rel="noopener noreferrer"&gt;https://stackoverflow.com/questions/19120676/how-to-detect-type-of-compression-used-on-the-file-if-no-file-extension-is-spe/19127748#19127748&lt;/a&gt;&lt;/p&gt;

</description>
      <category>compression</category>
      <category>zip</category>
      <category>gzip</category>
      <category>zlib</category>
    </item>
  </channel>
</rss>
