Forem: biellls

Impress your friends! Make a serverless bot that sends daily jokes to a Telegram Group

biellls — Thu, 31 Mar 2022 20:29:49 +0000

Typhoon Orchestrator is a great way to deploy ETL workflow on AWS Lambda. In this tutorial we intend to show how easy to use and versatile it is by deploying code to Lambda that gets a random joke from https://jokeapi.dev once a day and sends it to your telegram group.

Getting started

The first thing you need to do is install typhoon and the rest of the dependencies needed for this tutorial, preferrably in a virtual environment.

pip install typhoon-orchestrator[dev]
pip install python-telegram-bot
pip install requests

Next we create our project, we will call our project jester (we could call it anything).

typhoon init jester --template minimal
cd jester
typhoon status

Notice that the status command gives us the following warning: Connections YAML not found. To add connections create connections.yml. This is normal because typhoon normally uses a metadata database where you can store connections and variables, but we don’t want to create and use any DynamoDB tables for this tutorial so we used the minimal template that doesn’t include anything related to the metadata database. If you see any warnings about the metadata database during the course of the tutorial don’t worry, it’s for the same reason.

Tell me a joke!

Before we worry about telegram, let’s create a workflow that calls the joke API and prints the joke on your CLI. Create the file: dags/send_me_a_joke.yml:

name: send_me_a_joke
schedule_interval: '@daily'

tasks:
  get_joke:
    function: typhoon.http.get_raw
    args:
      url: https://v2.jokeapi.dev/joke/Programming?blacklistFlags=nsfw,religious,political,racist,sexist,explicit&type=single

  select_joke_text:
    input: get_joke
    function: typhoon.json.search
    args:
      data: !Py $BATCH.response.json()
      expression: joke

  tell_joke:
    input: select_joke_text
    function: typhoon.debug.echo
    args:
      joke: !Py $BATCH

This workflow has three tasks using built-in functions:

get_joke: Calls the joke API and gets a response like to the following:

{
    "error": false,
    "category": "Programming",
    "type": "single",
    "joke": "A man is smoking a cigarette and blowing smoke rings into the air. His girlfriend becomes irritated with the smoke and says \"Can't you see the warning on the cigarette pack? Smoking is hazardous to your health!\" to which the man replies, \"I am a programmer.  We don't worry about warnings; we only worry about errors.\"",
    "flags": {
        "nsfw": false,
        "religious": false,
        "political": false,
        "racist": false,
        "sexist": false,
        "explicit": false
    },
    "id": 38,
    "safe": true,
    "lang": "en"
}

select_joke_text: Uses a JMESPath expression to select data from the JSON text.
tell_joke: Prints the joke text.

The !Py tag means that instead of passing it a YAML object, you are passing it a string representing python code to run. For example, foo: 4 is equivalent to foo: !Py 2+2. $BATCH is a special variable that holds whatever the previous function returned or yielded. In the case of the select_joke_test task where the input is the get_joke task, its function returned a NamedTuple with a response and some metadata, so that $BATCH.responseis a requests.Response object.

Lets run to see a joke in our terminal

typhoon dag run --dag-name send_me_a_joke

Piece of cake! But here comes the interesting part...

I want the joke on telegram

There is no built-in function in Typhoon to send a text to a telegram chat. Fortunately it’s very easy to extend Typhoon, so let’s make it ourselves.

Create the following file functions/msg.py:

import telegram

def send_message_telegram(token: str, message: str, chat_id: str) -> str:
    """Given a telegram bot token, chat_id and message,
       send the message to that chat"""
    bot = telegram.Bot(token=token)
    print(f'Sending message {message} to {chat_id}')
    bot.send_message(chat_id=chat_id, text=message)
    return message

And update the DAG file we created before at dags/send_me_a_joke.yml:

name: send_me_a_joke
schedule_interval: 0 10 * * *  # Send the joke at 10am every day

tasks:
  get_joke:
    function: typhoon.http.get_raw
    args:
      url: https://v2.jokeapi.dev/joke/Programming?blacklistFlags=nsfw,religious,political,racist,sexist,explicit&type=single

  select_joke_text:
    input: get_joke
    function: typhoon.json.search
    args:
      data: !Py $BATCH.response.json()
      expression: joke

  tell_joke:
    input: select_joke_text
    function: functions.msg.send_message_telegram
    args:
      message: !Py $BATCH
      token: !Var telegram_token
      chat_id: !Var chat_id

requirements:
  - python-telegram-bot
  - requests

Notice that for the token and chat id we have the !Var tag. This is because we don’t want to include a secret like a token in the code, so we will read it from a variable. If you are really perceptive you may be thinking: “Didn’t you say that we are using a minimal deployment where there is no metadata database to store variables on?” Yes, that’s 100% correct. Usually we would store variables in the metadata database. However, we will use the alternate method of storing variables which is using an environment variable that starts with TYPHOON_VARIABLE_.

To create a bot with the botfather and get a token follow the official tutorial https://core.telegram.org/bots#creating-a-new-bot
To find out your chat ID check out https://stackoverflow.com/questions/32423837/telegram-bot-how-to-get-a-group-chat-id. Keep in mind that you can only add the bot to group chats, not private conversations.

export TYPHOON_VARIABLE_telegram_token="MY_SECRET_TELEGRAM_TOKEN"
export TYPHOON_VARIABLE_chat_id="128332492187641"

Now that we have everything ready, let’s send some jokes.

typhoon dag run --dag-name send_me_a_joke

If everything was correctly set up you should get the notification with a random programmer joke!

Aiming for the clouds

Build and upload the workflow

This is all well and good, but we want the bot to tell us a joke every day without needing to run the code locally. First of all let’s compile our code into a zip and upload it to S3 so that Lambda can use it. This can be a little tedious, but luckily Typhoon takes care of that for us. We need to tell it to which S3 bucket we want to deploy to. You will also need a configured AWS profile. Open the .typhoonremotes file and modify it to use your profile and S3 bucket.

[test]
aws-profile=myaws
s3-bucket=typhoon-orchestrator

Now that we have a remote called test we are ready to create the zip files and push them to S3. You will need to have docker installed for this step because the dependencies need to be built in an OS that is compatible with the one Lambda is using, otherwise they won’t work. This is a very common source of problems that Typhoon helps you avoid. If you are sure that your OS is compatible you can add the flag --build-deps-locally, but it is generally not recomended.

typhoon dag push --dag-name send_me_a_joke test

This will have taken a very long time because Typhoon built all of the dependencies, but don’t worry updating the workflow code is much much faster since the dependencies are separated into a layer and don’t need to be re-deployed unless they change.

The test at the end tells it what remote to deploy to. In the future we could add a different production environment with its own remote.

If you check your S3 bucket now you’ll find two files:

The lambda code: typhoon_dag_builds/send_me_a_joke/lambda.zip
All the necessary dependencies: typhoon_dag_builds/send_me_a_joke/layer.zip

Deploying infrastructure

For this part you will need to install and set up terraform. Learn more about infrastructure as code here.

Typhoon automatically creates some terraform files that describe all the necessary infrastructure to create in order to deploy our workflow to AWS Lambda. This greatly simplifies the creation of all the necessary resources that you would otherwise need to create manually. More importantly, it provides you a starting point while also giving you full control to change the terraform files until you have the desired configuration.

For this tutorial you just need to update the test variables file to include the S3 bucket name and some DAG info. We can get the info for all the dags by running typhoon dag info --json-output --indent 2, but in this case we will need to adapt it to include the necessary environment variables. This means that you will need to add the following to the file terraform/test.tfvars.

dag_info = {
    "send_me_a_joke": {
        "schedule_interval": "cron(0 10 * * ? *)",
        "environment": {
            "TYPHOON_VARIABLE_telegram_token": "MY_SECRET_TELEGRAM_TOKEN",
            "TYPHOON_VARIABLE_chat_id": "128332492187641"
        }
    }
}

Notice how the schedule interval is in a different format than the one we defined. This is because Terraform maps to AWS resources, and AWS uses its own flavor of cron expressions which is incompatible with the standard Unix cron expressions used by tools like cron, crontab, Airflow and many more. Typhoon aims to be a framework that can deploy to many platforms (currently supports AWS Lambda and Airflow) so we decided to follow the industry standard instead of AWS’s. Luckily, when we run typhoon dag info ... Typhoon converts it to AWS’s standard so you don’t need to do that yourself!

Now we are ready to create the infrastructure with terraform.

export AWS_PROFILE=my-aws-profile
export AWS_DEFAULT_REGION=eu-west-1
cd terraform
terraform init
terraform plan -var-file=test.tfvars -out=tfplan
terraform apply tfplan

And voila! You can check all of the resources that have been created in AWS and take a moment to appreciate how much time we’ve saved.

Let’s take it for a spin

If everything worked correctly you will get a joke in your telegram chat at 10am, but we don’t want to wait that long, we want to hear one now! You could invoke the Lambda from the AWS console, but we will invoke it with Typhoon.

typhoon dag run --dag-name send_me_a_joke test

Hopefully you got a hilarious joke sent right to your group chat.

This is the same command we used earlier to run the workflow locally, but with testat the end specifying that we want to run it in the remote environment. This has invoked a lambda and shown you the logs. Actually, to be more precise, it has invoked a Lambda that has then invoked another Lambda and then invoked another Lambda. Why? Because Typhoon is asynchronous by default which means that as soon as a function returns or yields a batch we invoke a new Lambda to process it. This is useful because you can have a lot of tasks performing work in parallel. For example, imagine you have a workflow that reads FTP CSV files, zips them up and uploads to S3. The first task could list all the CSV files in the FTP and yields each path as a batch. Then the next task will compress them which can take a long time, but we actually invoked a new Lambda instance for each batch so we are processing them all in parallel!

Notice how even though the workflow ran across three lambdas, you still got the full log in your terminal. Lambdas can be hard to monitor and debug, but Typhoon tries to make this process easier. This is why when you run a Typhoon DAG manually, it waits for a response so that it can print the logs. Every invocation will in turn also wait for the response of any Lambdas it invokes so you will end up with the full log no matter how many Lambda invocations the workflow ran on. It’s extremely useful to be able to see if the DAG is working correctly, but it does introduce synchronicity so the DAG will run slower. We believe it’s a worthwile tradeoff for manual invocations. Rest assured that when the workflow is triggered on schedule it will run at full speed.

Why can’t I just run everything in one lambda?

Great question, and there’s no reason not to since our worflow is very light and doesn’t benefit from parallelism. You just need to modify the first two tasks to make them synchronous with asynchronous: False. This is the relevant part of the code:

tasks:
  get_joke:
    function: typhoon.http.get_raw
    asynchronous: false
    args:
      url: https://v2.jokeapi.dev/joke/Programming?blacklistFlags=nsfw,religious,political,racist,sexist,explicit&type=single

  select_joke_text:
    input: get_joke
    function: typhoon.json.search
    asynchronous: false
    args:
      data: !Py $BATCH.response.json()
      expression: joke

Lets build and deploy the code, this time without dependencies by using the flag --code.

typhoon dag push --dag-name send_me_a_joke test --code

Wow, that was much faster! You can see that once the workflow has been deployed one time with all the dependencies, making changes and deploying them is very fast and easy. Feel free to run the DAG again to check out how only one Lambda will be invoked now.

This is good to be true, can I really build all my ETLs like this?

Yes and no... Depending on your use case Lambda can be a good fit, but there are currently some limitations to this approach:

Lambdas can only run for 15 minutes. If you have a long running task this will not work for you. In the future we intend to support Fargate to run heavier tasks and solve this issue.
Can we really do away with the scheduler? We have shown you a utopian vision of the future of ETLs. It still remains to be seen if we can fully avoid running a scheduler, and we may run into the harsh reality that if you want to be able to implement sensors, rate-limit tasks, etc. we may need a scheduler. Even if that turns out to be true, it would always be opt-in and much simpler than a traditional one.

Does that mean that Typhoon is not ready for prime time?

Absolutely not! We may have a long (albeit exciting) path ahead to realize our vision of a battle tested, fully serverless, asynchronous workflow orchestrator, but AWS is not the only target. Typhoon supports compilation to native Airflow code, the most popular orchestrator around today. This feature can bridge the gap between the simplicity of our vision and the complex reality we currently live in as Data Engineers.

Our hope is that you will use Typhoon and fall in love with the simplicity of our vision, and deploy to Airflow if the current state of AWS deployment can’t meet your needs.

Cleaning up

If you want to clean up all the resources that were created on this tutorial run the following command:

terraform plan -var-file=test.tfvars -out=tfplan -destroy
terraform apply -destroy tfplan

Thanks for following along!

If you enjoyed this tutorial we hope to see you soon at https://github.com/typhoon-data-org/typhoon-orchestrator. Check out the code, leave a star, open an issue or come say hi on our discord!

Modern data warehouse patterns: ELT with Snowflake variants

biellls — Sat, 29 Jan 2022 19:02:14 +0000

Leveraging semi-structured data for resilience against schema changes

As data warehouse technologies get cheaper and better, ELT is gaining momentum over ETL. In this article we will show you how to leverage Snowflake's semi-structured data to build integrations that are highly resistant to changes in schema while staying performant. Schema changes are one of the most common things that can break a data pipeline (adding and removing fields, changes in types or length of the data etc.) so it is extremely useful to protect yourself against them.

Real world example- Personal information

Let's assume we have a table with basic information about our clients. The goal is to load the information into snowflake unchanged.

name	surname	age
Anne	Houston	38
John	Doe	22
William	Williams	27

We would usually create the following table in Snowflake:

CREATE TABLE clients (name VARCHAR, surname VARCHAR, age NUMBER);

Notice how we don't specify the varchar's length or the number's precision and scale. This is preferable because snowflake will automatically use the minimum size needed to store the data efficiently, and if the source system changes the length of a varchar, or the precision of a number your flows won't break. An exception is when a number has decimals we will need to specify a precision and scale.

But if we do that, our integration will fail if a field is removed, and if a field is added we won't notice. We are not resilient to schema changes. To solve that we will instead create a table with just one variant field where we will load all the data, no matter what fields it has.

CREATE TABLE clients_raw (src VARIANT);

In order to load the data we can dump it as JSON into a stage.

CREATE OR REPLACE FILE FORMAT json_format TYPE = JSON;
CREATE OR REPLACE STAGE mystage FILE_FORMAT = json_format;

Let's create a file with some JSON data to load into the table. Run the following in a shell:

echo '{"name": "Anne", "surname": "Houston", "age": 37}' > /tmp/data.json
echo '{"name": "John", "surname": "Doe", "age": 21}' >> /tmp/data.json
echo '{"name": "William", "surname": "Williams", "age": 26}' >> /tmp/data.json

Next we run this in snowflake to upload the data to a stage and then load the data into the table:

put file:///tmp/data.json @mystage
COPY INTO clients_raw FROM @mystage/data.json FILE_FORMAT = json_format;

We can now query the data as:

SELECT min(src:age) as age from clients_raw;

Creating a view

It is easy to query the data, but it can be verbose and a little confusing to analysts who have never worked with unstructured data. In order to make it transparent to the end user we can create a view that turns it into structured data.

CREATE VIEW clients AS
SELECT
    src:name::VARCHAR AS name,
    src:surname::VARCHAR AS surname,
    src:age::NUMERIC AS age
FROM clients_raw;

The same query from before would now be:

SELECT min(age) as age from clients;

And now it's indistinguishable from a structured table from the user’s point of view.

Removing a column, adding a column

Suppose that database admins realized that storing age in a column is not ideal, since it needs to be updated every time a client has a birthday. Instead he decides to drop the age column and store a date with their birthday. The new table is as follows:

name	surname	birthday
Anne	Houston	1984-03-12
John	Doe	2000-01-03
William	Williams	1995-02-04

Let's create the new data:

echo '{"name": "Anne", "surname": "Houston", "birthday": "1984-03-12"}' > /tmp/data.json
echo '{"name": "John", "surname": "Doe", "birthday": "2000-01-03"}' >> /tmp/data.json
echo '{"name": "William", "surname": "Williams", "birthday": "1995-02-04"}' >> /tmp/data.json

We would usually append the data, but to make this tutorial simple we will just replace the old data with the new one.

TRUNCATE TABLE clients_raw;
COPY INTO clients_raw FROM @mystage/data.json FILE_FORMAT = json_format;
put file:///tmp/data.json @mystage;

Since we store all available data as a variant our integration will not break. The view would not break either, but the age would show as null (try SELECT * FROM clients). The only thing we need to do to take advantage of the new field is to update the view. For backwards compatibility we will still include the age as a calculation.

CREATE OR REPLACE VIEW clients AS
SELECT
    src:name::VARCHAR as name,
    src:surname::VARCHAR as surname,
    src:birthday::DATE as birthday,
    DATEDIFF('years', src:birthday, CURRENT_DATE()) as age
FROM clients_raw;

That's it, our pipelines never broke and there’s no need to change our data flows or source table definitions!

If a new field gets added to the table and no one notices it's still getting staged into snowflake in the variant so the moment someone requests the field in the view we'll be able to see it, without needing to backfill the data.

Doesn't this take up more space than regular tables? Isn't it slower to query?

This excerpt from Snowflake's docs answers the question:

For data that is mostly regular and uses only native types (strings and integers), the storage requirements and query performance for operations on relational data and data in a VARIANT column is very similar.
For better pruning and less storage consumption, we recommend flattening your object and key data into separate relational columns if your semi-structured data includes:

Dates and timestamps, especially non-ISO 8601dates and timestamps, as string values
Numbers within strings
Arrays

Non-native values such as dates and timestamps are stored as strings when loaded into a VARIANT column, so operations on these values could be slower and also consume more space than when stored in a relational column with the corresponding data type.

So in terms of performance and storage it should be really similar albeit a little slower. An exception would be if we need to query the birthday because it's stored as a string, as we will see in the following section.

Improving performance

Because variants store dates as strings, they are not as efficient to filter by. This is only an issue if the table is large and you intend to query the table by that date. Let's see an example of how to improve performance in that case:

CREATE OR REPLACE TABLE clients_raw (src VARIANT, birthday DATE);
COPY INTO clients_raw FROM (
    select
        $1 as src,
        to_date($1:birthday)::DATE AS birthday 
    FROM @mystage/data.json
) FILE_FORMAT = json_format;

And modify the view:

CREATE OR REPLACE VIEW clients
SELECT
    src:name::VARCHAR as name,
    src:surname::VARCHAR as surname,
    birthday,     // <-- We changed this to get the column directly
    DATEDIFF('years', birthday, CURRENT_DATE()) as age  // <-- Here too
FROM clients_raw;

Now queries filtering by birthday (or getting MAX(birthday) for example) will be much faster.

What is the best way to load the data?

The most efficient way to load the data into a table is by using a COPY command since Snowflake can optimize a bulk load. It can't do that with insert statements. Here are some of the most popular ways to load the data into snowflake, each with their advantges and disadvantages:

CSV: A gzipped CSV is the fastest way to load structured data into snowflake. It takes more space than parquet. It can also be loaded into a variant column with the right casting (see example later). The data can not be loaded easily into a variant.
JSON: Can be easily loaded into a variant, but it takes a lot of space in your data lake.
Avro: Built-in schema, easily loaded into a variant or into a structured table. Takes more space than parquet.
Parquet: Columnar storage that has a better compression than the other options and can easily be loaded into a structured or unstructured table. It is slower than CSV to load into a structured table.

Bottom Line

Loading data into Snowflake using this method is a great way to save you a lot of headaches and minimise data pipeline failures. It is a good rule of thumb to always use this method unless you will be loading an extremely large amount of data and need the extra 20% performance that a structured table will give you. If you decide to create a structured table instead of using this method be aware that the pipelines will break on any schema change.

What are the best tools to load data like this?

Any ETL/ELT tool that is flexible enough, for instance Airflow can be adapted to use this method. You can also check out our ETL tool that encourages this pattern and other modern best practices for data engineering https://github.com/typhoon-data-org/typhoon-orchestrator.

Sources

https://www.snowflake.com/wp-content/uploads/2015/06/Snowflake_Semistructured_Data_WP_1_0_062015.pdf

https://docs.snowflake.com/en/user-guide/semistructured-considerations.html#storing-semi-structured-data-in-a-variant-column-vs-flattening-the-nested-structure

Standing on the shoulders of giants. Part one: Airflow

biellls — Fri, 21 Jan 2022 20:01:29 +0000

Airflow advanced the state of the art in ETL tools by providing an extremely flexible and reliable framework. It is easy to monitor your jobs and you can extend it with plugins to do anything that python can do. It also helped introduce the concept of functional batch data pipelines. By removing state from pipelines and enforcing strict boundaries on partitions of time you can more easily reprocess a partition of data without affecting the rest of it. If you're unfamiliar with that concept, this article by Airflow's creator Maxime Beauchemin is worth a read.

With Typhoon we aim to build on this concept to provide a framework with a stronger focus on software engineering principles. We will illustrate it in the following sections

Built for developers

Typhoon was built from the ground up to provide a great experience for developers. Besides providing great Intellisense it helps you implement software best practices.

Testable

Airflow is notoriously hard to test. Operators force coupling between logic, execution context and framework. Let's look at a really simple example:

class ExchangeRates(BaseOperator):
    def __init__(self, base: str, symbols: Optional[List[str]] = None):
        self.base = base
        self.symbols = symbols
        self.http_conn_id = http_conn_id

    def execute(context):
        params = {
            'start_at': context['execution_date'],
            'end_at': context['next_execution_date'],
        }
        full_endpoint = f'{ENDPOINT}/history'
        print(f'Calling endpoint {full_endpoint} for dates between {start_at}, {end_at}')
        if base:
            params['base'] = base
        if symbols:
            params['symbols'] = symbols
        hook = HttpsHook('get', http_conn_id=self.http_conn_id)
        response = hook.run(full_endpoint, params=params)
        context['task_instance'].xcom_push('response', response.json())

All your logic is in the execute function of your opertor, so in order to run a test you need to import airflow and create an instance of the operator. Not only that, but we need to provide a context similar to the one that airflow would provide. Finally, you would need to mock xcom and see that it's called with the value you expect it is. This is only a simple example but it can get much more complex once there is a source and a destination in the same component, magic macro rendering and more. Just in case this doesn't sound complex enough, notice we create a hook from its connection id. Yeah, you'll need to mock the airflow database too or spin up a temporary one. Good luck with that.

In contrast, the logic for typhoon tasks lives inside regular python functions. They don't make use of the framework unless they use a hook and even then it can be instantiated without a metadata database.

 def get_exchange_rates(
        hook: HTTPHook
        start_at:datetime,
        end_at: datetime,
        base: Optional[str] = None,
        symbols: Optional[List[str]] = None,
) -> dict:
    params = {
        'start_at': start_at,
        'end_at': end_at,
    }
    full_endpoint = f'{ENDPOINT}/history'
    print(f'Calling endpoint {full_endpoint} for dates between {start_at}, {end_at}')
    if base:
        params['base'] = base
    if symbols:
        params['symbols'] = symbols
    response = requests.get(full_endpoint, params=params)
    return response.json()

Testing this is as easy as:

def test_xr_get_history():
    symbols = ['EUR', 'PHP', 'HKD']
    start_at = date(2020, 1, 2)
    end_at = date(2020, 1, 3)
    hook = HTTPSHook(ConnParams(conn_type='https_hook', extra={'method': 'get'}))
    response = exchange_rates_api.get_history(
        hook=hook,
        start_at=start_at,
        end_at=end_at,
        base='USD',
        symbols=symbols,
    )
    print(response)
    assert set(response.keys()) == {'rates', 'start_at', 'end_at', 'base'}
    assert set(response['rates'].keys()) == {start_at.isoformat(), end_at.isoformat()}
    for k, v in response['rates'].items():
        assert set(v.keys()) == set(symbols)

We don't need to import the framework, mock anything or have a database running. We just give it some input and test the output. This takes the functional aspect in functional data pipelines even further.

Composable

Composability is one of the principles of good software engineering because it enables you to reuse existing functions or objects in order to achieve new behaviour. Airflow gets in the way of that by coupling context, as we explained in the previous section, but also by encouraging task isolation. Tasks can't pass data between them, only some metadata through XCom and even that is discouraged. That means that you can't have an FTPExtractOperator and an S3LoadOperator, you need an FTPToS3Operator and every other possible combination of sources and destinations. This does not compose well as you end up with a lot of repeated code across different operators just because you can't easily reuse the logic.

In typhoon tasks can pass any data between them without any performance penalty. You can have a function that extracts data from a source and another one that loads into a destination. You can reuse those functions in any other DAG that uses that source or destination.

name: example
schedule_interval: rate(1 day)

tasks: 
    extract_files:
        component: typhoon.get_data_from_files
        args:
            hook: !Hook my_ftp
            pattern: /base/path/*.csv

    load_files:
        input: extract_files
        function: typhoon.filesystem.write_data
        args:
            hook: !Hook my_s3
            data: !Py $BATCH.data
            path: !MultiStep
                - !Py typhoon.files.name($BATCH.path)
                - !Py f'/some/path/{$1}'

Extensible

There are several ways in which the framework facilitates extension.

Just python. One of typhoon's goals is to be easily extensible with regular python code. You can create python functions and call them in your DAGs.
Interfaces. Hooks are grouped into interfaces in a lot of cases where it makes sense to make them interchangeable. This means you can easily switch a hook that writes to files in your OS for local development into an S3 hook for the integration tests and production. More importantly, since a lot of functions take a hook of a specific interface, if you create a new hook that conforms to that interface it will automatically be compatible with all those functions.
Natively support additional connection types. When you create a new kind of hook and give it a conn_type, this will be used to discriminate the class when a hook instance is created from a connection defined in the metadata.

Quick feedback

Typhoon aims to provide a lightning fast feedback loop on all steps of the DAG creation process. From debug hooks that print whatever is passed to them, to interchangeable hooks so you can easily develop, test and deploy against different targets, to being able to run the whole DAG from the command line instead of needing to schedule it or run independent tasks.

Debugging

Typhoon is designed from the ground up to be easy to debug and it achieves this by compiling to regular python that can be executed locally and debugged from your favorite IDE.

Try it out!

If you're curious on what the future of data pipelines could look like check out Typhoon at https://github.com/typhoon-data-org/typhoon-orchestrator.

Compression: Clearing the Confusion on ZIP, GZIP, Zlib and DEFLATE

biellls — Wed, 23 Oct 2019 21:14:13 +0000

Introduction

Roughly a month ago I found myself puzzled at work as to why 7-ZIP in Windows could not recognize GZIP files that we compressed in python. We used zlib library, which claims to be "Compression compatible with GZIP." It seemed there was more than meets the eye.

After a few google searches that left me more confused about what was going on than I initially was I decided to just open a subprocess and use the GNU implementation of GZIP. The final code wasn't too long and compressed in a way that 7-ZIP as well as Snowflake were able to detect automatically. The implementation was quite succinct, so I didn't give it much thought.



def gnu_zip(data: bytes) -> bytes:
    p = subprocess.run(['gzip'], input=data, stdout=subprocess.PIPE)

    if p.stderr:
        logging.error(p.stderr)
        raise Exception('Error decompressing data')

    return p.stdout

Still, something bugged me. How did zlib and GZIP relate? Why did zlib claim to be compatible with GZIP when it was clearly not the same format? We also noticed that even though Snowflake was unable to detect the zlib compressed file, if we told it that it's a GZIPped file it was able to load it without issues. Clearly the claim of compatibility wasn't completely outlandish. After a few weeks I decided to dive in and investigate in depth. The following is a summary of what I learned.

DEFLATE

I was surprised to find out that GZIP, zlib or even ZIP are not compression algorithms, they are actually file formats that can permit different compression algorithms. Even more surprising, virtually every implementation of those three actually use the same lossless data compression algorithm. This algorithm is called DEFLATE.

So are GZIP, zlib and ZIP actually the same? Not quite, and the final size of the compressed file can vary significantly between ZIP and the other two for reasons we will see below, but under the hood the actual compression is done in exactly the same way.

How does DEFLATE work? In short, it takes some input data as a stream consisting of a series of blocks of data, then uses a combination of LZ77 algorithm and Huffman coding on each block.

LZ77 identifies duplicate strings and replaces them with a back reference, which is a pointer to the place where it previously appeared, followed by the length of the string. This is done on the raw data blocks. For a more detailed explanation and example see Bonus section 1.

Huffman coding is known as bit reduction, and identifies the commonly used symbols and replaces them with symbols with shorter bit sequences. Infrequently used symbols will be represented with longer bit sequences. This is done on the LZ77 compressed blocks. For a more detailed explanation and example see Bonus section 2.

See the following link for a more in depth explanation of deflate: https://zlib.net/feldspar.html

ZIP

Released in 1989 and written by Phil Katz, ZIP is the oldest of the compression formats discussed in this article. It is also unique in that it's an archive file format, meaning it can compress multiple files and entire directory structures.

ZIP applies DEFLATE compression separately to each file it stores and then keeps a central directory structure at the end. This means that it can provide random access to each file which can be read separately, but the final size is larger since the compression does not take advantage of redundancy across files.

Finally, it also includes a CRC-32 checksum for data integrity.

GZIP

After some patent disputes with the unix compress utility, the GZIP format was developed in 1992 by Jean-loup Gailly and Mark Adler using a new implementation of DEFLATE that did not infringe on any patents.

Unlike ZIP, it is not an archive format. This means it can not compress several files or directories, it just compresses a single file or stream of data. That's why it's frequently combined with the tar utility which can create an archive of files, directories and their attributes in a single file which is then compressed with GZIP. This popular format is called tarball and its files end in .tar.gz. Tarballs do not provide access to the files contained, instead the whole file needs to be read and decompressed in memory before the directory structure can be shown.

It has a GZIP wrapper on the compressed data with the filename and other system information, and a CRC-32 checksum at the end to check the integrity of the data. On the other hand, the final size is usually smaller than zip since it does take advantage of redundancy across files.



import gzip

from io import BytesIO

def gzip_data(data: bytes) -> bytes:

    out = BytesIO()

    with gzip.GzipFile(fileobj=out, mode="wb") as f:

        f.write(data)

    out.seek(0)

    return out.getvalue()

Zlib

The authors of GZIP later extracted its DEFLATE implementation into a library named zlib so it could be reused by other formats, most notably PNG images. PNG images replaced the GIF format that was plagued with the same patent issues as unix compress. It is the most popular DEFLATE implementation and is used by many existing programs. Most HTTP servers use zlib to compress their data.

But that's not all, zlib has the option to use a GZIP wrapper on the compressed data or a lighter zlib wrapper. This means that apart from being a library, zlib can also be considered a compression format that has separate headers from other formats. This is the reason why our files were compatible with GZIP encoding but Snowflake couldn't auto detect them as GZIP, since they had zlib headers. This format is a light wrapper over raw deflate and does not contain a CRC checksum.



import zlib

def zlib_data(data: bytes) -> bytes:

    return zlib.compress(data)

Comparison

	zlib	GZIP	ZIP
Headers	0x78(01/9C/DA)	1F8B	504B0304
Compression format	DEFLATE	DEFLATE	DEFLATE
Checksum	None	CRC-32	CRC-32
Lossless?	Yes	Yes	Yes
Data	Stream / single file	Stream / single file	Archive files and directories

You can check the file type with the following code:



HEADERS = (

    ('zlib-no-compression', '7801'),

    ('zlib-default-compression', '789c'),

    ('zlib-best-compression', '78da'),

    ('gzip', '1f8b'),    # 1f8b08 if it's using deflate (almost always)

    ('zip', '504b0304'),

)

def compression_type(data: bytes) -> str:

    for compression, header in HEADERS:

        if data.hex().startswith(header):

            return compression

Bonus 1: LZ77 Algorithm

LZ77 is a lossless compression algorithm that replaces a sequence of symbols which had already appeared previously with a pointer to the place it last appeared and a number indicating the length of the sequence. The notation is where B is the pointer that indicates how many symbols ago the sequence appeared, and L is the sequence length.

LZ77 does not keep a dictionary of sequences (in contrast to LZ78), but instead uses a sliding window to search for them. This means that it only looks back inside the data up to a fixed distance (window). For a more detailed explanation click on the following link.

Bonus 2: Huffman codes

Huffman coding is also a lossless compression algorithm. The main idea is that symbols that appear more often should be encoded with less bits than symbols that appear little, resulting in a shorter file overall.

Starting with the less used symbols, a leaf node is created for each of them and a tree is created by by joining them together with a parent node whose value is the sum of their frequencies. This process is repeated and we keep joining the less frequent symbols or subtrees until we get a final tree of which the leaves are the symbols. To know how to encode a symbol we need to traverse the subtree where each left node represents a 0 and each right node represents a 1 until we reach the leaf node representing the symbol.

Example:

(abridged from https://www.thecrazyprogrammer.com/2014/09/huffman-coding-algorithm-with-example.html)

Symbol	Frequency
F	12
D	15
C	7
E	4
A	5

Other references

Some great stack overflow answers by Mark Adler, co-author of GZIP and Zlib.

https://stackoverflow.com/questions/19120676/how-to-detect-type-of-compression-used-on-the-file-if-no-file-extension-is-spe/19127748#19127748