Forem: Nazli Ander

Realtime Analysis of Cryptocurrency Prices Using dbt, Materialize, Redpanda & Metabase

Nazli Ander — Mon, 07 Mar 2022 20:24:51 +0000

Materialize organized an Online Hack Day a while ago. And they provided a structured streaming setup using dbt (ETL framework), Redpanda (queue), Materialize (database) and Metabase (visualization).

The initial setup was using flight data with OpenSky API to aggregate flight information. I re-purposed the structured streaming setup to use cryptocurrency data in a real-time dashboard.

Without changing much of the setup, it was a great comfort to create a new producer via CoinRanking. Then I created two financial queries in dbt to answer the following questions in real-time:

What is the difference of marketcap within the last 20 minutes per crypto currency?
What are the most deviating cryptocurrencies within the last 20 minutes?
What are the average prices per cryptocurrency within the last 20 minutes?

I connected the queries (materialized views in Materialize) to Metabase, to visualize my answers. Here is a screenshot from the resulting example dashboard in Metabase:

In this small memorial write-up, I will try to summarize the pipeline for this analysis. I aim to show the easiness of creating a real-time financial analysis with it.

Pipeline Summary

The pipeline that was created by Materialize had the following chronological order, and I re-purposed the same structure by changing the producer:

Ingest real-time data with a Python producer into RedPanda
Create a source from RedPanda in Materialize using dbt (raw data ingestion)
Create a staging view in Materialize to type-cast JSON fields (staging ingested data)
Create a materialized view(s) in Materialize to produce real-time windowed aggregations
Use Metabase to visualize data The technologies relate to each other as the following diagram suggests:

The setup is initialized with a very diligent docker-compose file, created by @morsapaes. When we docker-compose up, we have the following services running on your local machine:

IMAGE                                               COMMAND                  PORTS                                                                NAMES
sample_project_data-generator                       "python -u ./crypto.…"                                                                        data-generator
sample_project_dbt                                  "/bin/bash"              0.0.0.0:8002->8080/tcp                                               dbt
metabase/metabase                                   "/app/run_metabase.sh"   0.0.0.0:3030->3000/tcp                                               metabase
docker.vectorized.io/vectorized/redpanda:v21.11.3   "/entrypoint.sh 'red…"   0.0.0.0:8081-8082->8081-8082/tcp, 0.0.0.0:9092->9092/tcp, 9644/tcp   redpanda
materialize/materialized:v0.20.0                    "tini -- materialize…"   0.0.0.0:6875->6875/tcp                                               materialized

CoinRanking Producer in a Nutshell

To fetch real-time CoinRanking data into RedPanda, we need a small script. The script needs to read data from CoinRanking within periods. Then it needs to flush the read data into the RedPanda instance. Since RedPanda is Kafka compatible, that is quite straight forward with Python Kafka client.

Producer script basically reads data from the CoinRanking API, using a token. Then it picks up the required fields from the API data requested. Lastly it ingests the list of coin information into RedPanda within every 10 seconds. Here is the compact script:

import logging
import requests
import schedule
from kafka import KafkaProducer

logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)-7s %(levelname)-1s %(message)s",
    datefmt="%Y-%m-%d %H:%M:%S",
    handlers=[
        logging.StreamHandler(),
    ],
)
COIN_PAGE = "https://api.coinranking.com/v2/coins"
CRYPTO_TOPIC = "crypto"

TOKEN = os.getenv(
    "TOKEN",
    ".",
)
HEADERS = {"Authorization": f"access_token {TOKEN}"}
FREQUENCY_INGESTION = 10


def produce_list_of_coin_dict_into_queue(list_of_dict: list) -> None:
    producer = KafkaProducer(bootstrap_servers="redpanda:9092")
    for coin_with_model in list_of_dict:
        try:
            producer.send(
                topic=CRYPTO_TOPIC,
                value=dumps(coin_with_model).encode("utf-8"),
                key=uuid.uuid4().hex.encode("utf-8"),
            )
        except Exception as e:
            logging.error(f"The problem is: {e}!")
    producer.flush()


def get_json_api(page: str) -> tuple:
    get_request = requests.get(page, headers=HEADERS)
    assert get_request.status_code == 200, "Request not successful"
    return get_request.json(), get_request.status_code


def get_coin_model(coin: dict) -> dict:
    try:
        return {
            "uuid": coin.get("uuid"),
            "name": coin.get("name"),
            "symbol": coin.get("symbol"),
            "btc_price": coin.get("btcPrice"),
            "last_24h_volume": coin.get("24hVolume"),
            "marketcap": coin.get("marketCap"),
            "price": coin.get("price"),
            "timestamp": datetime.now().strftime("%Y-%m-%d %H:%M:%S"),
        }
    except Exception as e:
        logging.error(f"Exception: {e}")
        return {}


def coins_producer() -> None:
    all_coins, _ = get_json_api(COIN_PAGE)
    coins_with_model = get_all_coins_with_model(all_coins)
    produce_list_of_coin_dict_into_queue(coins_with_model)


if __name__ == "__main__":
    coins_producer()
    schedule.every(FREQUENCY_INGESTION).seconds.do(coins_producer)
    while True:
        schedule.run_pending()
        time.sleep(1)

By using this producer, we regularly ingest crypto data into RedPanda with a topic called crypto. We expect to asynchronously consume the ingested data from the topic crypto and use in our analyses.

RedPanda Source in Materialize

The project uses dbt adapter for Materialize. The adapter (dbt-materialize) is a Python package available on PyPI. This package allows us to use SQL + Jinja statements to efficiently transform streaming data and continuously update our data.

According to Materialize, the connected data sources are called source. Creating a source in Materialize is possible by introducing a queue connection with a simple DDL statement.

Since we make use of dbt, this is as easy as using the following lines:

{{
    config(
        materialized='source',
        tags=['crypto']
    )
}}

{% set source_name %}
    {{ mz_generate_name('rc_coins') }}
{% endset %}

CREATE SOURCE {{ source_name }}
FROM KAFKA BROKER 'redpanda:9092' TOPIC 'crypto'
  KEY FORMAT BYTES
  VALUE FORMAT BYTES
ENVELOPE UPSERT;

After defining the source, ingested data will be recognized by the database. However, we need to convert the string to the database encoding and JSON. This is required for column mapping, as our data is stored in JSON object literals. In the end we can map the column values by accessing and casting the object values:

{{
    config(
        materialized='view',
        tags=['crypto']
    )
}}

WITH converted_casted AS (
    SELECT 
        CAST(CONVERT_FROM(data, 'utf8') AS jsonb) AS data
    FROM {{ ref('rc_coins') }}
)
SELECT
    (data->>'uuid')::string as uuid,
    (data->>'name')::string as name,
    (data->>'symbol')::string as symbol,
    (data->>'btc_price')::numeric as btc_price,
    (data->>'last_24h_volume')::numeric as last_24h_volume,
    (data->>'marketcap')::numeric as marketcap,
    (data->>'price')::double as price,
    (data->>'timestamp')::timestamp as timestamp
FROM converted_casted

Materialized Views in Materialize

One of the great features of Materialize is their materialized view. This is a streaming computation of a SELECT query in incrementally updated materialized views.

As in every structured streaming framework, to provide a valid analysis we can benefit from defining a set of assumptions for windowing and lateness of data (grace periods). Currently, Materialize makes use of mz_logical_timestamp() function to define windowing and grace periods of data. The mz_logical_timestamp() function represents the current timestamp in milliseconds at the time of the query execution.

The questions require us to keep track of the last 20 minutes of data. Theoretically we need to analyze data in sliding windows. Following the nice explanation in Materialize documentation, using the mz_logical_timestamp() we can compute the answering queries of our crypto analysis for the past 20 minutes.

{{
    config(
        materialized ='materializedview',
        tags=['crypto']
    )
}}

{# 20 mins #}
{% set slide_threshold = '1200000' %}

WITH with_casted_insertion AS (

    SELECT *, extract(epoch from timestamp) * 1000 AS inserted_at
    FROM {{ ref('stg_crypto') }}

)

SELECT * FROM with_casted_insertion
WHERE TRUE
    AND mz_logical_timestamp() >= inserted_at
    AND mz_logical_timestamp() < inserted_at + {{ slide_threshold }}

In the producer we did not provide an insertion date in milliseconds. By using SQL statements we can convert the field timestamp into milliseconds and call this as inserted_at. This new field can be our benchmark for windowing.

The mz_logical_timestamp() and its usage is quite complicated. I suggest reading the nice documentation page from Materialize to see other use cases and their explanations.

On top of the materialized view created above we can write our analytics queries to expose in a dashboarding tool, such as Metabase. The analytics queries would use dbt functionalities as well, materialized as materializedview. You can find the resulting queries in the repository.

Resulting Lineage and Last Words

After obtaining the analytics queries, all we need to do is to run only once the crypto tagged models from dbt:

dbt run -m tag:crypto

The output from dbt will be as:

19:01:19  Found 10 models, 0 tests, 0 snapshots, 0 analyses, 180 macros, 0 operations, 0 seed files, 0 sources, 0 exposures, 0 metrics
19:01:19
19:01:19  Concurrency: 1 threads (target='dev')
19:01:19
19:01:19  1 of 5 START source model public.rc_coins....................................... [RUN]
19:01:19  1 of 5 OK created source model public.rc_coins.................................. [CREATE SOURCE in 0.08s]
19:01:19  2 of 5 START view model public.stg_crypto....................................... [RUN]
19:01:19  2 of 5 OK created view model public.stg_crypto.................................. [CREATE VIEW in 0.05s]
19:01:19  3 of 5 START materializedview model public.fct_crypto_sliding_window............ [RUN]
19:01:19  3 of 5 OK created materializedview model public.fct_crypto_sliding_window....... [CREATE VIEW in 0.06s]
19:01:19  4 of 5 START materializedview model public.marketcap_changes.................... [RUN]
19:01:19  4 of 5 OK created materializedview model public.marketcap_changes............... [CREATE VIEW in 0.06s]
19:01:19  5 of 5 START materializedview model public.volatile_cryptos..................... [RUN]
19:01:20  5 of 5 OK created materializedview model public.volatile_cryptos................ [CREATE VIEW in 0.05s]
19:01:20
19:01:20  Finished running 1 source model, 1 view model, 3 materializedview models in 0.38s.
19:01:20
19:01:20  Completed successfully
19:01:20
19:01:20  Done. PASS=5 WARN=0 ERROR=0 SKIP=0 TOTAL=5

The resulting dbt lineage diagram would be as follows:

Real-time analytics is used in getting immediate insights for financial data, news sources, social media reactions and healthcare. In most of the cases SQL is enough for obtaining basic insights in such systems. Thus, SQL dbt in combination with Materialize improve the efficiency of developing real-time analytics pipelines. The setup that was provided in the Hackday was a great example of this.

The biggest challenge after this point would be the maintenance of these pipelines. Luckily, there is sufficient information on integrating both RedPanda and Materialize with monitoring tools. As a next reading tip, I would nerdly suggest their documentation pages:

Using Google Cloud Platform Operators in Apache Airflow

Nazli Ander — Wed, 25 Aug 2021 15:34:32 +0000

Different Airflow operators create more possibilities while designing a scheduled workflow. Being aware of those enhances our way of dealing with real-world problems. *

There are many Airflow operators that keep impressing me during my daily job. Recently, I played quite a bit with the GCP operators within Airflow. In this write-up, I would like to first provide an overview of the Airflow Cloud Providers (used to be called contributor operators before Airflow 2.0, see). Then I would like to share one of the example Directed Acyclic Graphs (DAG) workflow models, created for this toy project.

Summary of the Airflow Cloud Providers

In the world of Airflow Google cloud providers package, which is also valid for Amazon Web Services (AWS), we have four main sub-groups of functions:

1. Operators: They are full-fledged operations, enabling us to execute read/create/delete/update/trigger tasks. Operators do not need to be running CRUD operations on datasets. They can also invoke a GCP function. Some examples and their example use cases:

BigQueryDeleteTableOperator: In the end of a BigQuery to Google Cloud Storage (GCS) operation, we might need to delete the existing table.
GCSCreateBucketOperator: To store a set of dataset partitions for specific dates, we might need to create a bucket in the beginning of a DAG.
CloudFunctionInvokeFunctionOperator: In case we update a certain point of a Cloud Function, we might need to test it in the end of a DAG.

2. Hooks: They are flexible clients, enabling us to interact with cloud providers. Hooks are the underlying clients in the operators, we can create customised functions (Python callable) or operators with those. Some examples are listed below. All include the methods for CRUD operations in the services they mention in their name:

3. Sensors: They are simply checkers. They check if data exists in a certain location. They can be used mostly for checking if a certain operation is completed. Some examples and their example use cases:

BigQueryTablePartitionExistenceSensor: You might need to run a sequential task to process the latest partition right after you check that is created by another task.
GCSObjectExistenceSensor: You might need to run a sequential task to process an object within a certain bucket right after you check that is created by another task.

4. Transfers: They are simply data transport operations. They enable us to move data from one bucket to another one. They can also move data from one service to another, even if the service belongs to another cloud provider. Some examples are listed below:

GCP operators in Airflow can be summarised as in the following chart:

We need a GCP connection (id) on Airflow, to have a functioning setup for Google Cloud operations on Airflow. The GCP connection can be set via configurations (some DevOps effort), or it can be set through the Airflow Web UI. It is explained here. Each of the GCP task that we create, to enable authorisation, we need to refer to the GCP connection id.

The Example GCP DAG

The example DAG is shown in the following chart. With this DAG, I aim to load a partitioned table into the Google Cloud Storage (GCS), then compose the multiple files generated by the previous process.

Each task within this DAG can be considered as a kind of GCP configuration. They are explained as follows:

1. BigQuery to GCS:

The task retrieves a partitioned table from BigQuery and exports into a GCS bucket. For each partition we create a separate gzipped text file (compression, export_format). Just to create a set of assumptions, I do not want to have the header row (print_header), and I use comma (,) as text file delimiter (field_delimiter). The partitions are created with the wildcard (*) character in the end of the file name (destination_cloud_storage_uris).



from airflow.models import DAG
from airflow.providers.google.cloud.transfers.bigquery_to_gcs import (
    BigQueryToGCSOperator)
from airflow.utils.dates import days_ago

# Define configuration global variables

with DAG(
    'gcp_dag',
    schedule_interval=None,
    start_date=days_ago(1),
    tags=['example'],
) as dag:

    bigquery_to_gcs = BigQueryToGCSOperator(
        gcp_conn_id='gcp_connection_id',
        task_id='bigquery_to_gcs',
        compression='GZIP',
        export_format='CSV',
        field_delimiter=',',
        print_header=False,
        source_project_dataset_table=f'{DATASET_NAME}.{TABLE}',
        destination_cloud_storage_uris=[
            f'gs://{DATA_EXPORT_BUCKET_NAME}/{EXPECTED_FILE_NAME}-*.csv.gz',
        ],
    )

    # Define other operations

2. GCS Compose:

GCP operators do not provide a full-fledged solution when it comes to the problem of composing multiple text files into one gzip file. Thus we need to create our own customised functions (Python callable) or operators (see this documentation).

The Python Operator, in my example DAG, calls a mysterious function (compose_files_into_one) with bucket_name, source_object_prefix, destination_object, gcp_conn_id parameters. Those parameters are provided as keys in the Python Operator's op_kwargs argument.



from airflow.models import DAG
from airflow.operators.python import PythonOperator
from airflow.utils.dates import days_ago

# Define configuration global variables

with DAG(
    'gcp_dag',
    schedule_interval=None,
    start_date=days_ago(1),
    tags=['example'],
) as dag:

    # Previous operations

    compose_files = PythonOperator(
        task_id='gcs_compose',
        python_callable=compose_files_into_one,
        op_kwargs={
            'bucket_name': DATA_EXPORT_BUCKET_NAME,
            'source_object_prefix': EXPECTED_FILE_NAME,
            'destination_object': f'{EXPECTED_FILE_NAME}.csv.gz',
            'gcp_conn_id': 'gcp_connection_id'
            },
    )

    # Any other operations

As the Python Operator calls compose_files_into_one, all the magic happens there. compose_files_into_one is a function contains all the hook logic. It uses the GCSHook as a client to list all the objects with the given prefix. Then it composes the partition files into one gzip file.



from airflow.providers.google.cloud.hooks.gcs import GCSHook


def compose_files_into_one(bucket_name: str,
                           source_object_prefix: str,
                           destination_object: str,
                           gcp_conn_id: str) -> None:
    '''Composes wildcarded files into one in the given destination'''
    gcs_hook = GCSHook(
        gcp_conn_id=gcp_conn_id
    )
    list_of_objects = gcs_hook.list(
        bucket_name,
        prefix=source_object_prefix
    )
    gcs_hook.compose(
        bucket_name,
        source_objects=list_of_objects,
        destination_object=destination_object
    )

The compose operation can be summarised with the following chart:

3. Delete Redundant Objects:

In the end of the compose task, I have the partitioned datasets as well as the already composed gzip file in the GCS bucket, so we had the same data twice as in multiple files and in a compact gzip file.

As a conscientious developer I want to delete the remaining partitions from the bucket and keep only the compact gzip file. Thus, I use GCS Delete Objects Operator to create the last task of my example DAG.

GCS Delete Objects Operator does not require much of a configuration. It needs a bucket_name, a gcp_conn_id, and a prefix parameter. With the prefix parameter, we can filter out the objects to be deleted. So, if the partitions start with <defined_text_file_name>-<partition_number>, then we can use <defined_text_file_name>- to filter out all the partitions to be deleted.



from airflow.models import DAG

from airflow.providers.google.cloud.operators.gcs import (

    GCSDeleteObjectsOperator)

from airflow.utils.dates import days_ago

with DAG(

    'gcp_dag',

    schedule_interval=None,

    start_date=days_ago(1),

    tags=['example'],

) as dag:

<span class="c1"># Previous operations





    delete_combined_objects = GCSDeleteObjectsOperator(

        task_id='gcs_combined_files_delete',

        gcp_conn_id='gcp_connection_id',

        bucket_name=DATA_EXPORT_BUCKET_NAME,

        prefix=f'{EXPECTED_FILE_NAME}-'

    )

<span class="c1"># DAG definition (dependency)

Last Words

GCP operators in Airflow are quite extendable and lightweight, and they require a small amount of configuration. Most of the operators are well-fitting for the use cases that I am able to think of. This write-up is a result of my appreciation of this nicely evolved providers package of Airflow. With every release of the providers' packages, yet another shortcut is added in the lives of data programmers. This is another good example of how being aware of those shortcuts can make our workflows more reliable and maintainable.

To have a look at the whole example DAG, with its Docker-compose setup, you can refer to its Github repository.

(*) This reminds me a lot of the godly boons I got when playing Hades, which are limited upgrades acquired by the protagonist, Zagreus. Gods randomly provide upgrades (boons) in Zagreus' (the protagonist of the game) journeys. Those upgrades make Zagreus' life easier for some time, exactly like the Google Cloud Platform (GCP) operators in Airflow being boons that make our lives easier for a while.

Using Pydantic as a Parsing and Data Validation Tool

Nazli Ander — Wed, 16 Dec 2020 20:17:42 +0000

Pydantic provides a BaseModel, which can be extended into different fields of collections for data modeling. It has support for Enum type, JSON conversion configurations, and even HTTP string parsing.

Of course, we need reasons to use all those nice functionalities. Hence, as every home-made-project has its storylines, I created a database use case related to NASA's APIs. First I will briefly explain the use case, followed by the concepts that I experimented with in my toy-project.

Introduction

NASA has a bunch of cool and publicly available APIs. All we need to do as developers to use them is to request an API KEY. Then we can request data related to the latest innovations that NASA has, or pictures of the day, or the weather notifications.

For this application, I chose the weather notifications (DONKI). Because it has some text and datetime fields that I could make use of in this parsing case.

With the requested weather notification data, I wanted to parse them nicely with Pydantic, then insert them into a document database (MongoDB).

A raw response looks like the following:

{
    "messageType":"RBE",
    "messageID":"20201007-AL-001",
    "messageURL":"https://kauai.ccmc.gsfc.nasa.gov/DONKI/view/Alert/15920/1",
    "messageIssueTime":"2020-10-07T17:18Z",
    "messageBody":"## NASA Goddard Space Flight Center, Space Weather Research Center ( SWRC )\n## Message Type: Space Weather Notification - Radiation Belt Enhancement\n##\n## Message Issue Date: 2020-10-07T17:18:51Z\n## Message ID: 20201007-AL-001\n##\n## Disclaimer: NOAA's Space Weather Prediction Center (http://swpc.noaa.gov) is the United States Government official source for space weather forecasts. This \"Experimental Research Information\" consists of preliminary NASA research products and should be interpreted and used accordingly.\n\n\n## Summary:\n\nSignificantly elevated energetic electron fluxes in the Earth's outer radiation belt. GOES \"greater than 2.0 MeV\" integral electron flux is above 1000 pfu starting at 2020-10-07T14:05Z. \n\nThe elevated energetic electron flux levels are caused by an S-type CME with ID 2020-09-30T12:09:00-CME-001.\n\nNASA spacecraft at GEO, MEO and other orbits passing through or in the vicinity of the Earth's outer radiation belt can be impacted.\n\nActivity ID: 2020-10-07T14:05:00-RBE-001."
}

The main goal of the toy-project is to take this raw data and insert it into a MongoDB while having all the mandatory fields with correct data formats. An example record is expected to look like below:

{ 
    "_id" : ObjectId("5fd917b83ecc12560ee43ef1"),
    "insertion_date" : ISODate("2020-12-15T20:08:23.091Z"),
    "message_type_abbreviation" : "RBE",
    "message_url" : "https://kauai.ccmc.gsfc.nasa.gov/DONKI/view/Alert/15920/1",
    "message_body" : 
        {
            "message_type" : "Space Weather Notification - Radiation Belt Enhancement",
            "message_issue_date" : ISODate("2020-10-07T17:18:51Z"),
            "message_id" : "20201007-AL-001",
            "disclaimer" : "NOAA's Space Weather Prediction Center (http://swpc.noaa.gov) is the United States Government official source for space weather forecasts. This \"Experimental Research Information\" consists of preliminary NASA research products and should be interpreted and used accordingly.",
            "summary" : "Significantly elevated energetic electron fluxes in the Earth's outer radiation belt. GOES \"greater than 2.0 MeV\" integral electron flux is above 1000 pfu starting at 2020-10-07T14:05Z. The elevated energetic electron flux levels are caused by an S-type CME with ID 2020-09-30T12:09:00-CME-001. NASA spacecraft at GEO, MEO and other orbits passing through or in the vicinity of the Earth's outer radiation belt can be impacted. Activity ID: 2020-10-07T14:05:00-RBE-001.",
            "notes" : null
        }
}

Some Remarks About the Dataset:

Message Type can be only one of the following categories: FLR, SEP, CME, IPS, MPC, GST, RBE, and Report.
Message Body has text fields separated with the following characters: \n##.

Some Nice Concepts From Pydantic

Pydantic comes with a BaseModel. With the BaseModel, we can parse some dictionaries with the correct typing, or change the behavior of a certain type while transforming the BaseModel into JSON.

Perhaps while dealing with nested forms of data, thinking from inside towards outside makes our lives easier. Hence, starting with the Message Body might be useful for this use case. The following example can be used for modeling the Message Body. For typing we can get help from the typing module, as an example to have optional fields:

from typing import Optional
from datetime import datetime
from pydantic import BaseModel


class NotificationMessageBody(BaseModel):
    message_type: str
    message_issue_date: datetime
    message_id: str
    disclaimer: Optional[str]
    summary: Optional[str]
    notes: Optional[str]

We can transform a NotificationMessageBody type into a JSON with the following code:

parsed_and_cleaned_message_body_dict = {
    "message_type": "Space Weather Notification - Radiation Belt Enhancement",
    "message_issue_date": "2020-10-07T17:18:51Z",
    "message_id": "20201007-AL-001",
    "disclaimer": "NOAA's Space Weather Prediction Center (http://swpc.noaa.gov) is the United States Government official source for space weather forecasts. This \"Experimental Research Information\" consists of preliminary NASA research products and should be interpreted and used accordingly.",
    "summary": "Significantly elevated energetic electron fluxes in the Earth's outer radiation belt. GOES \"greater than 2.0 MeV\" integral electron flux is above 1000 pfu starting at 2020-10-07T14:05Z. The elevated energetic electron flux levels are caused by an S-type CME with ID 2020-09-30T12:09:00-CME-001. NASA spacecraft at GEO, MEO and other orbits passing through or in the vicinity of the Earth's outer radiation belt can be impacted. Activity ID: 2020-10-07T14:05:00-RBE-001."
}

print(
    NotificationMessageBody(
        **parsed_and_cleaned_message_body_dict
    ).json()
)

This will print all the fields in a default format. The datetime field will be first transformed from string to Python datetime, then the default JSON transformation will output the ISO 8601 datetime string format. The notes field does not have an input in the example. For that reason the JSON transformation will ensure that it returns to null:

{
    "message_type": "Space Weather Notification - Radiation Belt Enhancement",
    "message_issue_date": "2020-10-07T17:18:51+00:00",
    "message_id": "20201007-AL-001",
    "disclaimer": "NOAA's Space Weather Prediction Center (http://swpc.noaa.gov) is the United States Government official source for space weather forecasts. This \"Experimental Research Information\" consists of preliminary NASA research products and should be interpreted and used accordingly.",
    "summary": "Significantly elevated energetic electron fluxes in the Earth's outer radiation belt. GOES \"greater than 2.0 MeV\" integral electron flux is above 1000 pfu starting at 2020-10-07T14:05Z. The elevated energetic electron flux levels are caused by an S-type CME with ID 2020-09-30T12:09:00-CME-001. NASA spacecraft at GEO, MEO and other orbits passing through or in the vicinity of the Earth's outer radiation belt can be impacted. Activity ID: 2020-10-07T14:05:00-RBE-001.",
    "notes": null
}

Lastly, we can model the whole NotificationMessage. We may choose to use an Enum to check if one of those Enum values within the MessageTypeAbbreviationEnum exists. Also, we might choose to use HttpUrl type from Pydantic. This assures that the URL string contains HTTP or HTTPS protocol. Besides, it breaks the URL into the pieces of scheme, host, tld, host_type, and path fields:

from datetime import datetime
from pydantic import BaseModel, HttpUrl
from enum import Enum


class MessageTypeAbbreviationEnum(str, Enum):
    FLR = "FLR"
    SEP = "SEP"
    CME = "CME"
    IPS = "IPS"
    MPC = "MPC"
    GST = "GST"
    RBE = "RBE"
    Report = "Report"


class NotificationMessage(BaseModel):
    insertion_date: datetime
    message_type_abbreviation: MessageTypeAbbreviationEnum
    message_url: HttpUrl
    message_body: NotificationMessageBody

    class Config:
        json_encoders = {
            datetime: lambda v: v.strftime("%Y-%m-%d %H:%M:%S")
        }

The example contains an additional configuration for the JSON encoders. This time different from the NotificationMessageBody example, the json_encoders ensures that the datetime is formatted as %Y-%m-%d %H:%M:%S as the JSON transformation is being done.

An example input:

parsed_and_cleaned_message_body_dict = {
    "message_type": "Space Weather Notification - Radiation Belt Enhancement",
    "message_issue_date": "2020-10-07T17:18:51Z",
    "message_id": "20201007-AL-001",
    "disclaimer": "NOAA's Space Weather Prediction Center (http://swpc.noaa.gov) is the United States Government official source for space weather forecasts. This \"Experimental Research Information\" consists of preliminary NASA research products and should be interpreted and used accordingly.",
    "summary": "Significantly elevated energetic electron fluxes in the Earth's outer radiation belt. GOES \"greater than 2.0 MeV\" integral electron flux is above 1000 pfu starting at 2020-10-07T14:05Z. The elevated energetic electron flux levels are caused by an S-type CME with ID 2020-09-30T12:09:00-CME-001. NASA spacecraft at GEO, MEO and other orbits passing through or in the vicinity of the Earth's outer radiation belt can be impacted. Activity ID: 2020-10-07T14:05:00-RBE-001."
}

parsed_and_cleaned_notification_message_dict = {
    "insertion_date": "2020-12-15T20:08:23.091Z",
    "message_type_abbreviation": "RBE",
    "message_url": "https://kauai.ccmc.gsfc.nasa.gov/DONKI/view/Alert/15920/1",
    "message_body": parsed_and_cleaned_message_body_dict
}

print(
    NotificationMessage(
        **parsed_and_cleaned_notification_message_dict
    ).json()
)

The example gives the following output. The datetime fields (insertion_date and message_issue_date) contain the given datetime format (%Y-%m-%d %H:%M:%S) in the configuration class (2020-12-15T20:08:23.091Z becomes 2020-12-15 20:08:23). And all the message_body fields are parsed as the NotificationMessageBody model suggests:

{
    "insertion_date": "2020-12-15 20:08:23",
    "message_type_abbreviation": "RBE",
    "message_url": "https://kauai.ccmc.gsfc.nasa.gov/DONKI/view/Alert/15920/1",
    "message_body": 
        {
            "message_type": "Space Weather Notification - Radiation Belt Enhancement",
            "message_issue_date": "2020-10-07 17:18:51",
            "message_id": "20201007-AL-001",
            "disclaimer": "NOAA's Space Weather Prediction Center (http://swpc.noaa.gov) is the United States Government official source for space weather forecasts. This \"Experimental Research Information\" consists of preliminary NASA research products and should be interpreted and used accordingly.",
            "summary": "Significantly elevated energetic electron fluxes in the Earth's outer radiation belt. GOES \"greater than 2.0 MeV\" integral electron flux is above 1000 pfu starting at 2020-10-07T14:05Z. The elevated energetic electron flux levels are caused by an S-type CME with ID 2020-09-30T12:09:00-CME-001. NASA spacecraft at GEO, MEO and other orbits passing through or in the vicinity of the Earth's outer radiation belt can be impacted. Activity ID: 2020-10-07T14:05:00-RBE-001.",
            "notes": null
        }
}

Insertion into the DB

After parsing and validating the nested dictionaries to be compliant with a
NotificationMessage model, one can insert those many notifications into a database. This might be a document database, as they have a similar structure as JSON. The project uses MongoDB.

The popular package for MongoDB, pymongo, has a nice method for inserting many records at one batch. Surprisingly this is called as insert_many. And it requires dictionary type. Just to mention, I used the dictionary transformation method from Pydantic for this purpose:

notifications = donki_parser.create_message_dictionary()
notifications_as_dict = list(map(lambda n: n.dict(), notifications))

notifications_repository = NotificationsRepository(
    host=MONGO_HOST,
    port=MONGO_PORT
)

notifications_repository.insert_many(notifications_as_dict)

Last Words

It is enjoyable to learn more about data parsing and validating in Python. Pydantic is a handy tool for this purpose. The examples that I gave require quite a bit of preprocessing. Perhaps that is because of the example data API that I have chosen. A large number of text fields in the message body make the data hard to parse. After solving this puzzle, Pydantic made sure that all fields are correct and ready to insert into a document database.

You can check more in the Pydantic documentation, about the functionalities that they provide.

For the whole project, please refer to the Github repository.

Creating a Development Environment for Spark Structured Streaming, Kafka, and Prometheus

Nazli Ander — Sun, 04 Oct 2020 19:53:22 +0000

Docker-compose allows us to simulate pretty complex programming setups in our local environments. It is very fun to test some hard-to-maintain technologies such as Kafka and Spark using Docker-compose.

A few months ago, I created a demo application while using Spark Structured Streaming, Kafka, and Prometheus within the same Docker-compose file. One can extend this list with an additional Grafana service. The codebase was in Python and I was ingesting live Crypto-currency prices into Kafka and consuming those through Spark Structured Streaming. In this write-up instead of talking about the Watermarks and Sinking types in Spark Structured Streaming, I will be only talking about the Docker-compose and how I set up my development environment using Spark, Kafka, Prometheus, and a Zookeeper. To have the whole codebase for my demo project, please refer to the Github repository.

Service Blocks

In the Docker-compose, I needed the following services to keep my streaming data producer and consumer live, at the same time monitor the ingestions into Kafka:

Spark standalone cluster: Consisting of one master and a worker code
- Spark-master
- Spark-worker
Zookeeper: A requirement for Kafka (soon it will not be a requirement) to maintain the brokers and topics. For instance, if a broker joins or dies, Zookeeper informs the cluster.
Kafka: A Message-oriented Middleware (MoM) for dealing with large streams of data. In this case, we have streams of crypto-currency prices.
Prometheus-JMX-Exporter: An exporter to connect Java Management Extensions (JMX) and translate into the language that Prometheus can understand. Remembering the Kafka is an example of a Java application, this will be a magic service that enables us to scrape Kafka metrics automatically.
Prometheus: Time-series database logging and modern alerting tool.

Spark Services

In the most basic setup for the standalone Spark cluster, we need one master and one worker node. You can use Docker-compose volumes for mounting folders. For Spark, perhaps the most common mounting reason is sharing the connectors (.jar files) or scripts.

For retrieving a Spark image from Docker Hub, as Big Data Europe has a very stable and extensive set of Spark Hadoop images, I preferred to use their images in my demo project. This prevented also some redundant work, like creating multiple Dockerfiles per Spark node.

I needed to take care of the Networking within the Docker-compose settings. Hence, I created a Bridge network with a custom naming as "crypto-network". The Bridge network enables us to run our standalone containers while communicating with each other. For more information about different network drivers in Docker containers, please refer to Docker documentation, very fun to read. While setting up I tried to give different forwarded host ports rather than using 8080 for the Web UI to prevent conflicts with JMX-Exporter. Besides, I wanted the worker nodes to be dependent on the master node to set up the order of container creations.

Lastly, following the BDE example, I override the SPARK_MASTER with environment variables. Here I am sharing the Spark component of the demo application.

---
version: "3.2"
services:

  spark-master:
    image: bde2020/spark-master:2.2.2-hadoop2.7
    container_name: spark-master
    networks:
      - crypto-network
    volumes:
      - ./connectors:/connectors
      - ./:/scripts/
    ports:
      - 8082:8080
      - 7077:7077
    environment:
      - INIT_DAEMON_STEP=false

  spark-worker-1:
    image: bde2020/spark-worker:2.2.2-hadoop2.7
    container_name: spark-worker-1
    networks:
      - crypto-network
    depends_on:
      - spark-master
    ports:
      - 8083:8081
    environment:
      - "SPARK_MASTER=spark://spark-master:7077"


networks:
  crypto-network:
    driver: "bridge"

You can start the services with:

docker-compose up

Then you can see the Spark master node setup with:

docker exec -it spark-master bash

Kafka Services

To run Kafka in a standalone mode, I needed Zookeeper and Kafka itself with some fancy environment variables. Basically, Kafka needs to find the Zookeeper client port and it needs to advertise the correct ports to Spark applications.

To run this setting I used the Confluent images. Here, I am sharing the Kafka related services block. A Confluent image already allows us to set up:

Kafka topics by using the environment variables:
- KAFKA_CREATE_TOPICS: Topic names to be created
- KAFKA_AUTO_CREATE_TOPICS_ENABLE: Self-explaining perhaps
- KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: Self-explaining perhaps
Connection to Zookeeper using the environment variable KAFKA_ZOOKEEPER_CONNECT
With KAFKA_BROKER_ID giving a custom broker id for a particular node
Advertising the correct ports for the docker network internal services or external connections:
- KAFKA_INTER_BROKER_LISTENER_NAME: Listener name for the setup
- KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: Listener setup with mapping
- KAFKA_ADVERTISED_LISTENERS: Listener setup for internal and external networking. This is a bit tricky, so if I consume or produce any message in the internal Docker network, with the example below I need to connect to kafka:29092. From outside of Docker, I can use a consumer or producer via localhost:9092. For more information, here is an awesome explanation.

---
version: "3.2"
services:
  zookeeper:
    image: confluentinc/cp-zookeeper
    container_name: zookeeper
    networks:
      - crypto-network
    environment:
      ZOOKEEPER_CLIENT_PORT: 2181
      ZOOKEEPER_TICK_TIME: 2000

  kafka:
    image: confluentinc/cp-kafka
    container_name: kafka
    depends_on:
      - zookeeper
    networks:
      - crypto-network
    ports:
      - 9092:9092
      - 30001:30001
    environment:
      KAFKA_CREATE_TOPICS: crypto_raw,crypto_latest_trends,crypto_moving_average
      KAFKA_BROKER_ID: 1
      KAFKA_ZOOKEEPER_CONNECT: zookeeper:2181
      KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: PLAINTEXT:PLAINTEXT,PLAINTEXT_HOST:PLAINTEXT
      KAFKA_INTER_BROKER_LISTENER_NAME: PLAINTEXT
      KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://kafka:29092,PLAINTEXT_HOST://localhost:9092
      KAFKA_AUTO_CREATE_TOPICS_ENABLE: "true"
      KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 1
      KAFKA_GROUP_INITIAL_REBALANCE_DELAY_MS: 100

networks:
  crypto-network:
    driver: "bridge"

Prometheus Services

In this project, I wanted to scrape Kafka's logs automatically. Hence, apart from the Prometheus service itself, I needed to also use the JMX-Exporter. And I realized that it is the coolest kid in a Docker-compose.

For both Prometheus and it JMX-Exporter, I needed to use custom Dockerfiles as they require some templates to be aware of each other. I used a separate ./tools/ folder to keep my monitoring related settings. And within the ./tools/prometheus-jmx-exporter, I had a confd folder to make use of and configure Docker containers at run-time. Here the file structure is as follows:

.
├── prometheus
│   ├── Dockerfile
│   └── prometheus.yml
└── prometheus-jmx-exporter
    ├── Dockerfile
    ├── confd
    │   ├── conf.d
    │   │   ├── kafka.yml.toml
    │   │   └── start-jmx-scraper.sh.toml
    │   └── templates
    │       ├── kafka.yml.tmpl
    │       └── start-jmx-scraper.sh.tmpl
    └── entrypoint.sh

Let's start with the Prometheus image as it is more straightforward. We need to use a custom Dockerfile to get the config with custom scraper settings.

The Dockerfile will be:

FROM prom/prometheus:v2.8.1

ADD ./prometheus.yml /etc/prometheus/prometheus.yml

CMD [ "--config.file=/etc/prometheus/prometheus.yml","--web.enable-admin-api" ]

And the prometheus.yml would be pointing the following, with a scrape interval of 5 seconds. In prometheus.yml, Prometheus targets a service called kafka-jmx-exporter with port 8080. Hence, in the Docker-compose, I should be using the same container name for JMX-Exporter as the targeted service.

global:
  scrape_interval:     5s
  evaluation_interval: 5s

scrape_configs:
  - job_name: 'kafka'
    scrape_interval: 5s
    static_configs:
      - targets: ['kafka-jmx-exporter:8080']

To create the JMX-Exporter image, I needed more tweaks. Let's start with the Dockerfile. The image for the JMX-Exporter uses a base image from Java. Then downloads from Maven repository JMX Prometheus .jar and writes to a file with the name /opt/jmx_prometheus_httpserver/jmx_prometheus_httpserver.jar. Next it downloads the Confd and stores in /usr/local/bin/confd, gives execute permissions. Lastly, it copies the entrypoint into /opt/entrypoint.sh.

FROM java:8

RUN mkdir /opt/jmx_prometheus_httpserver && wget 'https://repo1.maven.org/maven2/io/prometheus/jmx/jmx_prometheus_httpserver/0.11.0/jmx_prometheus_httpserver-0.11.0-jar-with-dependencies.jar' -O /opt/jmx_prometheus_httpserver/jmx_prometheus_httpserver.jar

ADD https://github.com/kelseyhightower/confd/releases/download/v0.16.0/confd-0.16.0-linux-amd64 /usr/local/bin/confd
COPY confd /etc/confd
RUN chmod +x /usr/local/bin/confd

COPY entrypoint.sh /opt/entrypoint.sh
ENTRYPOINT ["/opt/entrypoint.sh"]

In the entrypoint.sh, I had only the execution of Confd, then running the start-jmx-scraper.sh. Hence, after the Confd sets up the source and destination files for both Kafka and JMX Scrapers with .toml, we run the downloaded jmx_prometheus_httpserver.jar file. The entrypoint.sh looks like this:

#!/bin/bash
/usr/local/bin/confd -onetime -backend env
/opt/start-jmx-scraper.sh

And the start-jmx-scraper.shis as follows, the environment variables in Docker-compose define each of the key (JMX_PORT, JMX_HOST, HTTP_PORT, JMX_EXPORTER_CONFIG_FILE) mentioned in the command:

#!/bin/bash
java \
    -Dcom.sun.management.jmxremote.ssl=false \
    -Djava.rmi.server.hostname={{ getv "/jmx/host" }} \
    -Dcom.sun.management.jmxremote.authenticate=false \
    -Dcom.sun.management.jmxremote.port={{ getv "/jmx/port" }} \
    -jar /opt/jmx_prometheus_httpserver/jmx_prometheus_httpserver.jar \
    {{ getv "/http/port" }} \
    /opt/jmx_prometheus_httpserver/{{ getv "/jmx/exporter/config/file" }}

With the given custom Docker images for Prometheus automatically scraping Kafka, the full Docker-compose file for the demo project is as follows:

---
version: "3.2"
services:
  zookeeper:
    image: confluentinc/cp-zookeeper
    container_name: zookeeper
    networks:
      - crypto-network
    environment:
      ZOOKEEPER_CLIENT_PORT: 2181
      ZOOKEEPER_TICK_TIME: 2000

  kafka:
    image: confluentinc/cp-kafka
    container_name: kafka
    depends_on:
      - zookeeper
    networks:
      - crypto-network
    ports:
      - 9092:9092
      - 30001:30001
    environment:
      KAFKA_CREATE_TOPICS: crypto_raw,crypto_latest_trends,crypto_moving_average
      KAFKA_BROKER_ID: 1
      KAFKA_ZOOKEEPER_CONNECT: zookeeper:2181
      KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: PLAINTEXT:PLAINTEXT,PLAINTEXT_HOST:PLAINTEXT
      KAFKA_INTER_BROKER_LISTENER_NAME: PLAINTEXT
      KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://kafka:29092,PLAINTEXT_HOST://localhost:9092
      KAFKA_AUTO_CREATE_TOPICS_ENABLE: "true"
      KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 1
      KAFKA_GROUP_INITIAL_REBALANCE_DELAY_MS: 100
      KAFKA_JMX_PORT: 30001
      KAFKA_JMX_HOSTNAME: kafka

  kafka-jmx-exporter:
    build: ./tools/prometheus-jmx-exporter
    container_name: jmx-exporter
    ports:
      - 8080:8080
    links:
      - kafka
    networks:
      - crypto-network
    environment:
      JMX_PORT: 30001
      JMX_HOST: kafka
      HTTP_PORT: 8080
      JMX_EXPORTER_CONFIG_FILE: kafka.yml

  prometheus:
    build: ./tools/prometheus
    container_name: prometheus
    networks:
      - crypto-network
    ports:
      - 9090:9090

  spark-master:
    image: bde2020/spark-master:2.2.2-hadoop2.7
    container_name: spark-master
    networks:
      - crypto-network
    volumes:
      - ./connectors:/connectors
      - ./:/scripts/
    ports:
      - 8082:8080
      - 7077:7077
    environment:
      - INIT_DAEMON_STEP=setup_spark

  spark-worker-1:
    image: bde2020/spark-worker:2.2.2-hadoop2.7
    container_name: spark-worker-1
    networks:
      - crypto-network
    depends_on:
      - spark-master
    ports:
      - 8083:8081
    environment:
      - "SPARK_MASTER=spark://spark-master:7077"

  producer:
    build:
      context: .
      dockerfile: ./Dockerfile.producer
    container_name: producer
    depends_on:
      - kafka
    networks:
      - crypto-network

networks:
  crypto-network:
    driver: "bridge"

As the Docker-compose contains an additional Producer service when we run the following, we can test our Kafka topic messages per minute by checking the <IP_LOCAL>:9000:

docker-compose up

Here the output of the Prometheus UI will be as follows:

Last Words

This was a demo project that I made for studying Watermarks and Windowing functions in Streaming Data Processing. Therefore I needed to create a custom producer for Kafka, and consume those using Spark Structured Streaming. Although the development phase of the project was super fun, I also enjoyed creating this pretty long Docker-compose example.

In case more detail is needed, I am sharing the Github repository.

Writing Custom Cross-Validation Methods For Grid Search in Scikit-learn

Nazli Ander — Sat, 03 Oct 2020 11:29:21 +0000

Recently I was interested in applying Blocking Time Series Split following this lovely post in a Grid Search hyper-parameter tuning setting using scikit-learn library to maintain the time order and prevent information leakage. In this post, I will try to document some knowledge that I build while reading through the articles, documentation, and blog posts about custom cross-validation generators in Python.

It is great that scikit-learn provides a class called TimeSeriesSplit, and by using that we can generate fixed time interval training and test sets. Here is a basic example using scikit-learn data generators. I generate a regression dataset with 5 features and 30 samples. Then I generate 3 splits. For those 3 splits, we obtain 10 training examples and n_samples//(n_splits + 1) test examples:

import numpy as np
from sklearn.datasets import make_regression
from sklearn.model_selection import TimeSeriesSplit

X_experiment, y_experiment = make_regression(
    n_samples=30, n_features=5, noise=0.2)

tscv = TimeSeriesSplit(max_train_size=10, n_splits=3)

for idx, (x, y) in enumerate(tscv.split(X_experiment)):
    print(f"Split number: {idx}")
    print(f"Training indices: {x}")
    print(f"Test indices: {y}\n")

Here the output will be, and it will follow a Walk Forward Cross Validation pattern:

Split number: 0
Training indices: [0 1 2 3 4 5 6 7 8]
Test indices: [ 9 10 11 12 13 14 15]

Split number: 1
Training indices: [ 6  7  8  9 10 11 12 13 14 15]
Test indices: [16 17 18 19 20 21 22]

Split number: 2
Training indices: [13 14 15 16 17 18 19 20 21 22]
Test indices: [23 24 25 26 27 28 29]

However, the setting that I found was using dates instead of timestamps. This was leading to discrete numeric values as anchor points for cross-validation splits, instead of continuous. Hence, I was not able to leverage the TimeSeriesSplit from scikit-learn. Instead, I wrote a simple generator object with groupings for date splits to use in Grid Search.

class CustomCrossValidation:

    @classmethod
    def split(cls,
              X: pd.DataFrame,
              y: np.ndarray = None,
              groups: np.ndarray = None):
        """Returns to a grouped time series split generator."""
        assert len(X) == len(groups),  (
            "Length of the predictors is not"
            "matching with the groups.")
        # The min max index must be sorted in the range
        for group_idx in range(groups.min(), groups.max()):

            training_group = group_idx
            # Gets the next group right after
            # the training as test
            test_group = group_idx + 1
            training_indices = np.where(
                groups == training_group)[0]
            test_indices = np.where(groups == test_group)[0]
            if len(test_indices) > 0:
                # Yielding to training and testing indices
                # for cross-validation generator
                yield training_indices, test_indices

CustomCrossValidation is a simple class with one method (split) uses X (predictors), y (target values), and groups corresponding to the date groups. Those can be months or quarters for your dataset, however, I assumed that those can be mapped into integers to keep the order of time. Hence, if I have 3 quarters in the dataset, I can first have Q1, Q2, and Q3 as of date values. But I can simply map those into 0, 1, 2 to keep the order and use those in my validation generator class method.

The split method, with this naming, is required for GridSearchCV in scikit-learn. Here, I created a range of integers (groups) to keep the order of date. Then assigned the first group indices (t) to be training indices and the next (t + 1) to be validation indices. Then, in the end, the method yields to training and testing indices as the cv parameter of the GridSearchCV method requires a generator object with returning training and testing indices.

Here the example displays how the custom split works with the groups. To have different sizes of date groups, I created 4 groups with 5 instances of 0s, 10 instances of 1s, 10 instances of 2s, and 10 instances of 3s:

X_experiment, y_experiment = make_regression(
    n_samples=30, n_features=5, noise=0.2)

groups_experiment = np.concatenate([np.zeros(5),  # 5 0s
                                    np.ones(10),  # 10 1s
                                    2 * np.ones(10),  # 10 2s
                                    3 * np.ones(5)  # 10 3s
                                    ]).astype(int)

for idx, (x, y) in enumerate(
    CustomCrossValidation.split(X_experiment,
                                y_experiment,
                                groups_experiment)):
    print(f"Split number: {idx}")
    print(f"Training indices: {x}")
    print(f"Test indices: {y}\n")

The example dataset will look like with the groupings:

# The first 5 predictor values...
          0         1         2         3         4
0 -0.566298  0.099651  2.190456 -0.503476 -0.990536
1  0.174578  0.257550  0.404051 -0.074446  1.886186
2  0.314247 -0.908024 -0.562288 -1.412304 -1.012831
3 -1.106335 -1.196207 -0.479174  0.812526 -0.185659
4 -0.013497 -1.057711 -0.601707  0.822545  1.852278

# The first 5 target values...
            0
0   73.398681
1  195.221637
2 -139.402678
3 -124.863423
4   94.753517

# Groupings for the example dataset...
# The 0s are older date anchor values, whereas 3s the newest...
[0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3]

The groups will be used for having an order in the validation flow. Hence first the 0s are going to be used as the training set, and 1s as validation. Then the 1s are going to be used as training, the 2s as validation... The output of the example generated indices will be:

Split number: 0
Training indices: [0 1 2 3 4]
Test indices: [ 5  6  7  8  9 10 11 12 13 14]

Split number: 1
Training indices: [ 5  6  7  8  9 10 11 12 13 14]
Test indices: [15 16 17 18 19 20 21 22 23 24]

Split number: 2
Training indices: [15 16 17 18 19 20 21 22 23 24]
Test indices: [25 26 27 28 29]

To have an example setup, I will be using the Lasso Regression and try to optimize the alpha with Grid Search. In Lasso, when we have a larger alpha, this forces more coefficients to be 0. It is very common to search for the optimum values of alpha in a Lasso Regression.

# Instantiating the Lasso estimator
reg_estimator = linear_model.Lasso()
# Parameters
parameters_to_search = {"alpha": [0.1, 1, 10]}
# Splitter
custom_splitter = CustomCrossValidation.split(
    X=X_experiment,
    y=y_experiment,
    groups=groups_experiment)

# Search setup
reg_search = GridSearchCV(
    estimator=reg_estimator,
    param_grid=parameters_to_search,
    scoring="neg_root_mean_squared_error",
    cv=custom_splitter)
# Fitting
best_model = reg_search.fit(
    X=X_experiment,
    y=y_experiment,
    groups=groups_experiment)

This will output the best estimator as follows, using the custom cross-validation. There will be 3 splits as we used 4 groups.

# Best model:
Lasso(alpha=0.1)

# Number of splits:
3

Voila, having a simple generator helped me to have a custom validation flow in a Grid Search optimization. I enjoy reading scikit-learn documentation. Besides the fact that reading is fun, it helps me to understand some statistical implementations better and tweak whenever it is necessary.

To have a complete set of examples, please refer to the Github repository. Happy reading the documentation!

Windowing in Streaming Data: Theory and a Scikit-Multiflow Example

Nazli Ander — Fri, 14 Aug 2020 06:37:30 +0000

With the streaming data, it is almost impossible to store sufficient statistics all over the examples, because of the memory constraints. Moreover, the analysts in the information systems departments are usually more interested in the recent changes in the databases. With time, the properties of the datasets might change and the older data may not give the most important information for a statistical learning algorithm. The changes in the properties of data over time is called concept drift. To increase memory efficiency and overcome concept drift, windowing mechanisms are developed [1].

In this short review, I want to present four commonly known examples of windowing algorithms on a theoretical level. Those include from simplest to the most complex: Landmark, Sliding, Time-Fading, and Adaptive Sliding.

For the applied level, scikit-multiflow [2] provides a nice complementary for scikit-learn with streaming analytics algorithms. I applied ADWIN for detecting concept drift, out of my enthusiasm on the library and the algorithm. I will be sharing this.

Landmark

Landmark contains the whole historical data, starting from a landmark point. In the example chart, the alpha point is chosen as a landmark. A system, following the landmark mechanism, is going to include all the data from the alpha point. Landmark mechanism is used when periodic results are needed for the statistical analysis (yearly, monthly, quarterly).

Sliding Window

Sliding Window contains the data belonging to the time interval with fixed recency (ꞵ) and binary weighting. ꞵ defines the range of the window length. Hence, the starting point (timestamp) is calculated by subtracting the ꞵ from the current timestamp (t). This requires in each iteration, the older data to be deleted. A sliding window is used when the exact number of data points is critical for the statistical analysis, e.g., traffic monitoring, topic extraction on a news portal [3].

Time-Fading Window

A time-fading window is also named as a damped window. It contains the data belonging to the time frame with fixed recency (ꞵ) and a linear or exponential decaying factor (f). The continuous decaying logic takes the recency into account and emphasizes more the recent data less the older data with a certain threshold.

The time-fading window method can be considered as a special case of a sliding window or landmark method, depending on how the recency is defined.

Adaptive Sliding Window (ADWIN)

ADWIN adjusts the mean values of the objects and keeps those below a threshold level (epsilon). If the mean values significantly deviate from a threshold, it deletes the corresponding old part. It is adaptive to the changing data. For instance, if the change is taking place the window size will shrink automatically, else if the data is stationary the window size will grow to improve the accuracy.

A better visualization is presented in Albert Bifet's (author of the Adaptive Stream Mining: Pattern Learning and Mining from Evolving Data Streams) personal website.

The intuition behind using ADWIN is to keep statistics from a window of variable size while detecting concept drift. By using the scikit-multiflow library I simulated a distorted data stream with a normal distribution.

The code below is used for catching the concept drift in the normal distribution (with a mean of 0 and a standard deviation of 0.25). I changed the stream values with the indices between 1000 and 2000 with a different normal distribution (with a mean of 1 and a standard deviation of 0.5). Hence, I expected a width change (decrease) between the stream values 1000 till 2000 and an increase in width till the end of the stream.

import numpy as np

from skmultiflow.drift_detection.adwin import ADWIN

adwin = ADWIN(delta=0.0002)
SEED = np.random.seed(42)

# Simulating a data stream as a normal distribution of 1's and 0's
mu, sigma = 0, 0.25  # mean and standard deviation
data_stream = np.random.normal(mu, sigma, 4000)

# Changing the data concept from index 1000 to 2000
mu_broken, sigma_broken = 1, 0.5
data_stream[1000:2000] = np.random.normal(mu_broken, sigma_broken, 1000)

width_vs_variance = []

# Adding stream elements to ADWIN and verifying if drift occurred
for idx in range(4000):

    adwin.add_element(data_stream[idx])

    if adwin.detected_change():
        print(f"Change in index {idx} for stream value {data_stream[idx]}")

    width_vs_variance.append((adwin.width, adwin.variance, idx))

The output of this small test for observing the width changes with detected concept drift in our streaming data is displayed in the chart below. As expected, we can observe a zigzag (return back) between the indices (t) 1000 and 2000, then the width grows till the end of the stream.

References

[1] Gama, J., Sebastião, R. & Rodrigues, P.P. On evaluating stream learning algorithms. Mach Learn 90, 317–346 (2013).

[2] Montiel, J., Read, J., Bifet, A., & Abdessalem, T. (2018). Scikit-multiflow: A multi-output streaming framework. The Journal of Machine Learning Research, 19(72):1−5.

[3] Youn, J., Choi, J., Shim, J., & Lee, S. (2017). Partition-Based Clustering with Sliding Windows for Data Streams. Lecture Notes in Computer Science, 289–303.

Scraping and Storing Crypto-currency Prices with Scala and PostgreSQL

Nazli Ander — Sun, 02 Aug 2020 22:11:35 +0000

Web scraping mostly involves text-intensive tasks such as product review scraping, gathering real-estate listings, or even tracking online reputation and presence. When one application scrapes only String data types for qualitative analysis, it may not need type safety. However, in case the end goal of the web scraping is to do quantitative analysis with prices or weather forecasts, using a type-safe language might be quite handy.

In this article, we aim to give a small and interesting example of price scraping for crypto-currencies by using Scala and storing those into a PostgreSQL database. To scrape the prices we selected to use CoinMarketCap homepage. It is a crypto-currency knowledge website, which gives information also on market capitalizations (relative market sizes), circulating supply, and trading volumes. Even though it is fascinating to see all those information together, to keep it simple we will be only scraping the prices.

This article might be considered as a tutorial, and it requires a basic level of knowledge of docker-compose and Scala.

A summary of the pipeline that this tutorial uses

Tools and Steps

While web scraping in Scala, we will be using an HTML parsing library called scala-scraper with JSoup. Following that, we will be inserting the scraped prices to the PostgreSQL database by using a functional JDBC tool called doobie.

Although we mentioned some fancy library and tool names, the real magic happens in case classes. For each call to the CoinMarketCap homepage, we aim to retrieve the long crypto-currency table with type safety. To do that we created CoinCreate and CoinInsert case classes and companion objects.

We will start explaining first the case classes together with their companion objects, as we aimed to model the data while creating those. Then we will explain the simple functions for retrieving the updated price table from the homepage. Lastly, we will explain how we inserted the table records into the PostgreSQL database running locally. We can power the database with this simple docker-compose file. In the docker-compose file, we initialized a PostgreSQL database with a name dev, username admin, and a password as admin.

The steps that are explained in this tutorial are displayed in the pipeline above.

Case Classes and Companion Objects

Although there might be different approaches to model the data, we can start by creating two case classes as CoinCreate and CoinInsert. Those will help us to keep the data types safe while scraping the price table and inserting into a database.

A view from the CoinMarketCap homepage prices table

CoinCreate aims to safely type a pair of crypto-currency code and its current price. Thus, it has two parameters code(referring to the currency code) and price(current price in USD). However, while thinking about its companion object we need to consider the shape of the price records in each row. For instance, if we consider only to use coin names and prices in our case class, in an array of records their indices will be 1 and 3. This is quite similar to column indices for tables.

By observing the price table (above you can find a screenshot from the homepage), we decide to use a companion object to have an apply method for functionally transforming an input of String List to CoinCreate. Although this transformation is not that straightforward, we can use helper functions to get only the coin code (getCoinCode) and transform the dollar price string into a double (numberStringToDouble).

case class CoinCreate(code: String, price: Double)

object CoinCreate {
  def apply(listOfElements: List[String]): CoinCreate = {
    CoinCreate(
      code = getCoinCode(listOfElements(1)),
      price = numberStringToDouble(listOfElements(3))
    )
  }

  def dollarToNumber(dlr: String): Option[String] = {
    val p = "[0-9.]+".r
    p.findFirstIn(dlr)
  }

  def numberStringToDouble(strDlr: String): Double = {
    val numberStr = dollarToNumber(strDlr)
    numberStr.getOrElse("0").toDouble
  }

  def getCoinCode(strCoin: String): String = {
    strCoin.split(" ")(0)
  }
}

CoinInsert aims to safely type a pair of crypto-currency code, its current price, and a log timestamp for insertion time logging. We can use this case class while inserting a vector of CoinCreate into PostgreSQL. As its parameters are so similar to CoinCreate, we can create a simple companion object to transform a CoinCreate to CoinInsert. This object’s apply method can naturally add the current timestamp to a CoinCreate to obtain a CoinInsert.

Hence the only difference between a CoinCreate case class and CoinInsert case class will be the current Timestamp, notated as a logTimestamp parameter.

case class CoinInsert(code: String, price: Double, logTimestamp: Timestamp)

object CoinInsert {
  def apply(coin: CoinCreate, logTimestamp: Long): CoinInsert = {
    CoinInsert(
      code = coin.code,
      price = coin.price,
      logTimestamp = new Timestamp(logTimestamp)
    )
  }
}

Scraping Functions

Scraping with scala-scraper and JSoup is quite easy. First, we need to GET request to the homepage by creating a new JSoup browser. A new JSoup browser enables us to fetch HTML from the web. Since we need only HTML parsing JSoup was enough in this case, for Javascript using pages other browser options could be used.

def siteConnect(html: String, browser: JsoupBrowser): browser.DocumentType = {
    browser.get(html)
}

By using the GET request, we need to find the main table and store it as a Vector of Strings. Luckily when we specify that we are looking for a table element, scala-scraper’s table method does all the job for us.

def getCoinUpdatedTable(webPage: String,
                        tableNameInHTML: String): Vector[CoinCreate] = {
  val site = siteConnect(webPage, new JsoupBrowser()) // Connects to the webpage.

  val tab = site >> table(s"#${tableNameInHTML}") // Gets the table with the given name.

  val body = tab.slice(1, tab.length) // First index belongs to the header of the table.

  body.map(x => CoinCreate(x.map(_.text).toList)) // Table rows are mapped to CoinCreate case class.
}

Lastly, we need to slice the Vector, starting from the second index till the last lines, as the first line contains column names (headers). The resulting sliced Vector would still have the table rows with their HTML elements as String. So we can benefit from functional programming to map all the table rows (Vector elements) while extracting text in the elements then transform to CoinCreate (the comfort of having a tailor-made apply function).

Insertion Functions

doobie is an amazing functional JDBC layer for Scala. It provides a functional way to write any JDBC program. In this tutorial, we will keep it quite simple by writing only a connection Transactor to connect to the local PostgreSQL database and an insertion function to make the transactions with type-safety.

To connect to the database, we need to use a Transactor stating the type of the driver (in our case it is a PostgreSQL driver), URL for connection, user name, password, and an Execution Context (EC). The transactor needs an implicit val to define the EC. For non-blocking operations doobie’s Transactor uses contextShift. For testing our code doobie documentation recommends using synchronous EC.

implicit val cs = IO.contextShift(ExecutionContexts.synchronous)

val xa = Transactor.fromDriverManager[IO](
  "org.postgresql.Driver", // driver classname
  "jdbc:postgresql://localhost:5432/dev", // connect URL (driver-specific)
  "admin", // user
  "admin", // password
  ExecutionContexts.synchronous
)

For writing a row by row insertion function we can use SQL interpolation. The function has an input of CoinInsert and an output of Update0 (representing a parameterized statement where the arguments are known).

def insertCoin(coinInsert: CoinInsert): Update0 =
  sql"""
     INSERT INTO coins (code, price, logTimestamp)
     VALUES (${coinInsert.code}, ${coinInsert.price}, ${coinInsert.logTimestamp})
     """.update

Lastly, we can GET request the homepage by using the getCoinUpdatedTable function. This will return a Vector of Strings.

Consequently, we can use this Vector (coinTable) to transform CoinCreate to CoinInsert case class and execute the insert statement we prepared in the previous step.

val coinTable =
    getCoinUpdatedTable("https://coinmarketcap.com", "currencies")

coinTable
    .foreach { coinCreate =>
        val logTimestamp = Calendar.getInstance.getTimeInMillis
        val coinInsert = CoinInsert(coinCreate, logTimestamp)
        insertCoin(coinInsert).run.transact(xa).unsafeRunSync
    }

Last Words

Thanks to doobie, with only a few lines we were able to scrape the crypto-currency prices from CoinMarketCap and insert those into a local PostgreSQL database. Although the code does its job, for now, the source code can be extended with exception handling and monitoring. You can find the whole project in this Github repository.

This article was originally published in the following link.

Using Selenium With Python in a Docker Container

Nazli Ander — Sun, 02 Aug 2020 21:15:59 +0000

While web scraping, I came across many useful applications such as listing old prices of some financial assets or finding current news topics. Although those examples are quite interesting to apply, frequently there was one main goal to reach at the end that is creating a database with the scraped information.

Whenever I went a bit further on scraping, I ended up in the websites using Javascript to display the data that I needed. Hence, I bumped into Selenium, which is a web testing and automation tool. In this small write up, I aim to list some steps that I find quite useful while setting up Selenium within a Docker container.

Introduction to Selenium WebDriver

Selenium WebDriver is a web automation or testing tool. It was created by Simon Stewart in 2006, as the first cross-platform testing framework that could control the browser from the OS level.

So with Selenium, I can run some automated actions on browsers (clicks, hovers, and fill forms) by directly communicating with them. Java, C#, PHP, Python, Perl, Go and Ruby are the supported languages for the bindings. Since I am more familiar with Python, I will be talking about it.

To work on a browser, I need to choose among a set of browser options like Firefox, Chrome (Chromium), Edge, and Safari. As a personal opinion, Chrome with a headless option (not generating a user interface) is the most performant one, hence I will be sticking to that.

Pulling the Image and Setting Up Google Chrome

To start with my custom Selenium-Python image, I need a Python image, here in this write-up I picked up the version 3.8.

Then I can install Google Chrome on top of it. Remember, without the Google Chrome itself, I cannot run Selenium on top of it to run our tasks. There are a few steps to apply for setting up Google Chrome in Linux:

Adding Google Chrome trusting keys to apt
Adding Google Chrome stable version to the repositories
Updating the repositories to see the stable version in apt
Installing google-chrome-stable

FROM python:3.8

# Adding trusting keys to apt for repositories
RUN wget -q -O - https://dl-ssl.google.com/linux/linux_signing_key.pub | apt-key add -

# Adding Google Chrome to the repositories
RUN sh -c 'echo "deb [arch=amd64] http://dl.google.com/linux/chrome/deb/ stable main" >> /etc/apt/sources.list.d/google-chrome.list'

# Updating apt to see and install Google Chrome
RUN apt-get -y update

# Magic happens
RUN apt-get install -y google-chrome-stable

Installing Chrome Driver

Selenium requires a driver interface to work with the defined browser. Hence, I need to find a way to install Chrome Driver in our Linux image. Here are the steps to follow for doing this:

Installing unzip as we will need for the zipped Chrome Driver
Download the Chrome Driver into a folder called /tmp/chromedriver.zip, this name can be changed
Unzipping the /tmp/chromedriver.zip into the Linux executable path

After those steps, I need to set the display port (99) as Selenium is using this. It will avoid some crushes.

# Installing Unzip
RUN apt-get install -yqq unzip

# Download the Chrome Driver
RUN wget -O /tmp/chromedriver.zip http://chromedriver.storage.googleapis.com/`curl -sS chromedriver.storage.googleapis.com/LATEST_RELEASE`/chromedriver_linux64.zip

# Unzip the Chrome Driver into /usr/local/bin directory
RUN unzip /tmp/chromedriver.zip chromedriver -d /usr/local/bin/

# Set display port as an environment variable
ENV DISPLAY=:99

Preparing the Docker for a Run

All the steps above were only for setting up Chrome in our Dockerfile. To run my Python application (app.py) using Docker, I might need the following lines into our Dockerfile.

COPY . /app
WORKDIR /app

RUN pip install --upgrade pip

RUN pip install -r requirements.txt

CMD ["python", "./app.py"]

Apart from those Docker settings, I would like to briefly mention some Docker specific chrome options while setting up the Chrome Driver via Python. I want to explicitly show those a few options in one function as set_chrome_options. Here I set up the example pseudocode with a function below. I need 4 specific arguments to run our Chrome Driver inside Docker:

Explicitly saying that this is a headless application with --headless
Explicitly bypassing the security level in Docker with --no-sandbox. There is a nice Stackoverflow thread over this, apparently as Docker deamon always runs as a root user, Chrome crushes.
Explicitly disabling the usage of /dev/shm/. The /dev/shm partition is too small in certain VM environments, causing Chrome to fail or crash.
Disabling the images with chrome_prefs["profile.default_content_settings"] = {"images": 2}.

from selenium.webdriver.chrome.options import Options
from selenium import webdriver

def set_chrome_options() -> None:
    """Sets chrome options for Selenium.
    Chrome options for headless browser is enabled.
    """
    chrome_options = Options()
    chrome_options.add_argument("--headless")
    chrome_options.add_argument("--no-sandbox")
    chrome_options.add_argument("--disable-dev-shm-usage")
    chrome_prefs = {}
    chrome_options.experimental_options["prefs"] = chrome_prefs
    chrome_prefs["profile.default_content_settings"] = {"images": 2}
    return chrome_options

if __name__ == "__main__":
    driver = webdriver.Chrome(options=chrome_options)
    # Do stuff with your driver
    driver.close()

Last Words

Here is the Dockerfile, that I took as an example. While creating this, I used the links that I shared to solve the problems that I faced. There might be other kinds of solutions to the problems that I faced. I am curious to listen to those.

Until now, I used it to scrape web archives for asset prices, books, yellow pages, and judgment texts. Although Selenium is not designed for web scraping, I leveraged this nice tool for taming Javascript using websites. But I should admit that, if the information that I was looking for was not hiding in Javascript, I would have been definitely a lot happier with using only Requests, BeautifulSoup4 and/or Scrapy for Python. Because all those are simpler to set up, and more performant.

Happy Scraping!