Forem: James

What is the Modern Data Stack?

James — Fri, 29 Aug 2025 19:49:59 +0000

Introduction

When I started working on personal data projects, I kept running into the same obstacles. Governance was missing, so I often questioned which version of the data was correct. Scaling pipelines meant starting over instead of building on what I had. Reproducibility was frustrating.Running the same process twice sometimes gave different outcomes. Even small updates could break the flow and leave me backtracking.

These challenges are why hybrid architectures matter. Modern data work rarely lives in one place. Combining local systems with cloud platforms creates balance between control and scalability. Governance can be built into transformations, security can be enforced at multiple layers, and scaling does not depend on the limits of a single machine. A hybrid design makes the workflow more flexible while keeping it structured.

I have pieced together some of these requirements in a project to demonstrate why they matter. The Modern Data Stack applies DataOps practices and analytics engineering principles to show what it looks like when governance, reproducibility, and scalability are considered from the start.

Why These Practices Matter

DataOps applies the discipline of DevOps to data. It focuses on automating ingestion, testing transformations, and deploying changes with confidence. Analytics engineering builds on this foundation, shaping raw data into well-modeled tables that are easier to query and analyze.

Together they provide answers to the problems I faced in my projects:
• Silent pipeline failures are replaced with automated checks and alerts
• Business logic lives in code rather than scattered spreadsheets
• Environments can be recreated consistently with infrastructure as code

The aim is not to collect more tools. It is to make the workflow reliable, transparent, and scalable.

The Architecture

The project follows a layered approach to data.

Bronze (Raw): Google Sheets data lands in PostgreSQL via Python scripts.
Silver(Curated): New records are loaded incrementally into BigQuery.
Gold (Analytics-ready): dbt Cloud transforms and tests the data, making it usable for analysis.
Automation layer: Terraform provisions infrastructure, and GitHub Actions handle orchestration.

How It Works

Ingestion with Python and PostgreSQL

I started with Google Sheets as a source. Python scripts pull ticker data and load it into PostgreSQL. This creates a single entry point for raw data, instead of juggling multiple queries across tools.

last_timestamp = get_last_timestamp_from_bigquery()
new_data = query_postgres_for_new_data(last_timestamp)
append_to_bigquery(new_data)

This incremental pattern ensures that only new records are processed, making ingestion both efficient and scalable.

Transformations with dbt Cloud

dbt Cloud handles the transformation logic. Models capture how raw data should be reshaped, while tests validate the assumptions. By codifying transformations, the workflow becomes both transparent and reproducible. Instead of second-guessing results, I can trust the outputs because the checks are built into the process.

Infrastructure with Terraform

Provisioning is defined in Terraform. From databases to permissions, the setup can be recreated without manual steps. Version control captures every change, so the infrastructure evolves in the same structured way as the code.

CI/CD with GitHub Actions

GitHub Actions orchestrate the workflow. Each commit can trigger ingestion, transformations, and tests. Deployments run automatically, so the pipeline is not dependent on manual execution. This brings consistency and speed to the process.

Closing Thoughts

Small design choices have a big impact. A simple incremental load pattern can save hours when working with larger datasets.
dbt is more than a transformation tool. It acts as a shared framework where logic, documentation, and testing converge.
Infrastructure as code removes uncertainty. Rebuilding an environment becomes predictable instead of experimental.

Data workflows are only as strong as the discipline behind them. Without governance, scaling, and reproducibility, even small projects become fragile. By weaving DataOps and analytics engineering principles into the design, the workflow stops being a collection of scripts and turns into a system that can grow, adapt, and be trusted.

This is not a finished product, but a working demonstration of what modern data practices can look like in action.

You can explore the full implementation here: Modern Data Stack on GitHub.

SQL Window Functions in 2 Minutes

James — Sat, 01 Feb 2025 08:29:45 +0000

What Are Window Functions?

Unlike standard aggregate functions like SUM() and AVG(), window functions don’t merge rows into a single result. Instead, they compute values across a specific "window" of rows using the OVER() clause.

Example: Ranking employees by salary

SELECT name, salary,  
RANK() OVER (ORDER BY salary DESC) AS salary_rank  
FROM employees;

Key Window Functions

Ranking Functions: RANK(), DENSE_RANK(), ROW_NUMBER()

Aggregation Functions: SUM(), AVG(), COUNT()

Offset Functions: LAG(), LEAD(), FIRST_VALUE(), LAST_VALUE()

Using PARTITION BY for Grouping

To calculate values within specific categories, use PARTITION BY:

SELECT department, name,  
AVG(salary) OVER (PARTITION BY department) AS avg_dept_salary  
FROM employees;

This computes the average salary within each department while keeping individual row details intact.

Comparing Rows with LAG() & LEAD()

The LAG() and LEAD() functions allow you to access previous or next row values:

SELECT name, salary,  
LAG(salary) OVER (ORDER BY salary) AS prev_salary,  
LEAD(salary) OVER (ORDER BY salary) AS next_salary  
FROM employees;

Perfect for tracking salary changes over time!

Moving Averages & Running Totals

Use window frames (ROWS BETWEEN)to calculate rolling averages or running totals:

SELECT name, salary,  
AVG(salary) OVER (ORDER BY salary ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) AS moving_avg  
FROM employees;

Use windows functions to:

✅ Maintain row-level details while computing aggregates
✅ Efficiently perform ranking, running totals, and comparisons
✅ Eliminate the need for complex self-joins and subqueries

Platform-Specific Considerations

1. General SQL Support

Most modern relational databases support window functions, but syntax and performance optimizations may differ:
✅ Supported: PostgreSQL, MySQL (8.0+), SQL Server (2012+), Oracle (11g+), IBM Db2
⚠️ Limited Support: MySQL (<8.0), SQLite (Partial)

2. Platform-Specific Optimizations & Differences

PostgreSQL

✅ Fully supports window functions with PARTITION BY, ORDER BY, and frame clauses.
✅ Strong optimizer for LAG(), LEAD(), and RANK().
⚠️ No parallelism for window functions (performance can degrade with large datasets).

MySQL (8.0+)

✅ First MySQL version to introduce full window function support.
✅ Supports ranking, offsets, and aggregates within OVER().
⚠️ No RANGE-based window frame (only ROWS).
⚠️ No DISTINCT inside window aggregates (e.g., COUNT(DISTINCT col) OVER (...) fails).

SQL Server (2012+)

✅ Full support for window functions and frame clauses.
✅ Optimized for parallel query execution with indexed partitions.
⚠️ Pre-2012 versions require workarounds (e.g., using CROSS APPLY for ranking).

Oracle (11g+)

✅ Supports advanced window functions like FIRST_VALUE(), LAST_VALUE().
✅ MODEL clause allows complex analytics beyond window functions.
⚠️ Requires explicit indexing for performance tuning.

IBM Db2

✅ Offers robust support for window functions.
✅ Can perform parallel processing of window functions.
⚠️ Some advanced analytics require custom extensions.

3. Performance Considerations

Indexing Matters: Ensure that columns used in PARTITION BY or ORDER BY are indexed.

Query Execution Plans: Use EXPLAIN ANALYZE (PostgreSQL), EXPLAIN FORMAT=JSON (MySQL), or SET SHOWPLAN_ALL ON (SQL Server) to check performance.

Running Transformations on BigQuery using dbt Cloud: step by step

James — Thu, 10 Aug 2023 12:37:06 +0000

Introduction
In today's data-driven world, transforming raw data into valuable insights is crucial. This process, however, often involves complex tasks that demand efficiency, scalability, and reliability. Enter dbt Cloud—a powerful tool that simplifies data transformations on Google BigQuery. In this article, I'll take you through a step-by-step guide on how to run BigQuery transformations using dbt Cloud. Let's dive in!

Prerequisites
dbt cloud account
gcp account
for bigquery ensure you have the following

Service account with elevated roles and keys.json
bigquery api enabled
two bigquery datasets: staging and production

Step 1: Setting Up Your dbt Cloud Project
Start by signing in to dbt Cloud and creating a new project.

Create a project and define the connection to bigquery. Dbt requires you to upload a keys.json file that contains your biqguery credentials.Define your target dataset, this will be the destination for your transformed data during development. In this case mentalhealth_staging. You can choose a managed repository on clone a repository from github.
Initialize your project. DBT will create files and folders for you to start with.

By default, dbt will only allow you to work on a branch, therefore create a branch or more for which you can use for development or deployment

Step 2: Creating Transformations
In dbt Cloud, transformations are defined as dbt models. A dbt model is a SQL file containing the transformation logic. Write your SQL queries to transform and reshape your data as needed. These models can join tables, aggregate data, and create calculated fields.

define the dbt_project.yml. The dbt_project.yml file is a configuration file used in dbt (data build tool) projects. It's a YAML (Yet Another Markup Language) file that allows you to define various project settings and configurations in a structured format. This file serves as the central configuration hub for your dbt project, allowing you to customize how dbt behaves when executing data transformations.

# Name your project! Project names should contain only lowercase characters
# and underscores. A good package name should reflect your organization's
# name or the intended use of these models
name: 'bigquery_proj'
version: '1.0.0'
config-version: 2

# This setting configures which "profile" dbt uses for this project.
profile: 'default'

# These configurations specify where dbt should look for different types of files.
# The `source-paths` config, for example, states that models in this project can be
# found in the "models/" directory. You probably won't need to change these!
model-paths: ["models"]
analysis-paths: ["analyses"]
test-paths: ["tests"]
seed-paths: ["seeds"]
macro-paths: ["macros"]
snapshot-paths: ["snapshots"]

target-path: "target"  # directory which will store compiled SQL files
clean-targets:         # directories to be removed by `dbt clean`
  - "target"
  - "dbt_packages"


# Configuring models
# Full documentation: https://docs.getdbt.com/docs/configuring-models

# In this example config, we tell dbt to build all models in the example/ directory
# as tables. These settings can be overridden in the individual model files
# using the `{{ config(...) }}` macro.
models:
  bigquery_proj:
    staging:
      materialized: table
    core:
      materialized: table

Next we define dbt models for defining transformations. We shall create staging models for development and core models for deployment, we then add a macro for enhancing our models. macrosare reusable SQL code snippets that allow you to encapsulate and parameterize common SQL operations or patterns. Macros in dbt provide a way to abstract and simplify complex SQL logic, making your dbt models more modular, maintainable, and efficient.

 {#
    This macro returns gender into three categories 
#}

{% macro get_gender_properties(Gender) -%}

    case {{ Gender }}
        when 'male' then 'male'
        when 'female' then 'female'
        when 'f' then 'female'
        when 'm' then 'male'
        else 'others'
    end

{%- endmacro %}

1.staging models

extract_mentalhealthdata.sql

{{ config(
    materialized='table'
) }}

select 
-- identifier
rand() as unique_id,

-- use macro for coverting gender types
{{ get_gender_properties('Gender') }} as gendertype,

*
from {{ source('mentalhealth_staging','mental_health_table_1') }}

the above model extracts data from a bigquery dataset mentalhealth_staging .It also uses the defined macros
below is the schema for the model


version: 2


sources:
  - name: mentalhealth_staging
    #database: 
    schema: mentalhealth_staging
    tables:
      - name: mental_health_table_1



models:
    - name: extract_mentalhealthdata
      description: "extract mental health and load to a staging table"
      columns:
          - name: unique_id
            description: random unique id for every record
            tests:
                - unique
                - not_null
          - name: Timestamp
            description: Time the survey was submitted
          - name: Age
            description: Respondent age
          - name: Gender
            description: Respondent gender
          - name: Country
            description: Respondent country
          - name: state 
            description: If you live in the United States, which state or territory do you live in?
          - name: self_employed
            description: Are you self-employed?
          - name: family_history
            description: Do you have a family history of mental illness?
          - name: treatment
            description: Have you sought treatment for a mental health condition?
          - name: work_interfere 
            description: If you have a mental health condition, do you feel that it interferes with your work?
          - name: no_employees
            description: How many employees does your company or organization have?
          - name: remote_work
            description: How many employees does your company or organization have?
          - name: tech_company 
            description: your employer primarily a tech company/organization?
          - name: benefits 
            description: Does your employer provide mental health benefits?
          - name: care_options
            description: Do you know the options for mental health care your employer provides?
          - name: wellness_program 
            description: Has your employer ever discussed mental health as part of an employee wellness program?
          - name: seek_help 
            description: Does your employer provide resources to learn more about mental health issues and how to seek help?
          - name: anonymity 
            description: Is your anonymity protected if you choose to take advantage of mental health or substance abuse treatment
          - name: leave 
            description: How easy is it for you to take medical leave for a mental health condition?
          - name: mental_health_consequence
            description: Do you think that discussing a mental health issue with your employer would have negative consequences?
          - name: phy_health_consequence
            description: Do you think that discussing a physical health issue with your employer would have negative consequences?
          - name: coworkers
            description: Would you be willing to discuss a mental health issue with your coworkers?
          - name: supervisors
            description: Would you be willing to discuss a mental health issue with your direct supervisor(s)?
          - name: mental_health_interview
            description: Would you bring up a mental health issue with a potential employer in an interview?
          - name: phys_health_interview
            description: Would you bring up a physical health issue with a potential employer in an interview
          - name: mental_vs_physical
            description: Do you feel that your employer takes mental health as seriously as physical health?
          - name: obs_consequence
            description: Have you heard of or observed negative consequences for coworkers with mental health conditions in your
          - name: comments
            description: Any additional notes or comments

2.core models
The core models ae used to load data from the staging tables and persist to production tables in mentalhealth_prod dataset

employee_dim.sql

{{ config(
    materialized='table'
) }}

select 
-- identifier
unique_id,

-- employee details
Age,
gendertype,
Country,
state,
remote_work,
tech_company
from {{ source('mentalhealth_staging','extract_mentalhealthdata')}}

below is the defined schema the core models


version: 2


sources:
  - name: mentalhealth_staging
    #database: 
    schema: mentalhealth_staging
    tables:
      - name: extract_mentalhealthdata



models:
    - name: employee_dim
      description: "create employee dim tablele"
      columns:
          - name: unique_id
            description: unique id for every record
            tests:
                - unique
                - not_null
          - name: Age
          - name: gender_type
          - name: Country
          - name: state
          - name: remote_work
          - name: tech_company

Ensure you configure the schemas and data sources correctly.

now run the staging model using dbt build --select extract_mentalhealthdata.sql this will create a table with transformed data based on the model. It will also run some tests defined in the schema.yml

Step 3: Creating deployment environment and running the jobs in development environment
In dbt Cloud, you can create environment like "production" to manage production of your data transformation process. Configure your target dataset and other settings for your environment.
Before deploying your transformations, it's wise to test them in the development environment. Trigger the job and monitor the logs to identify any issues. If everything runs smoothly, you're ready to move to the deployment phase.

Create a deployment environment in dbt and define the target dataset in this case mentalhealth_prod. Note that this deployment environment runs the production branch created.

Next you create jobs to run you models in production env. Here you can define other parameters such as scheduled time of running jobs, commands to run, generate docs etc.

Setting other parameters for the job runs

Step 7: Monitoring and Maintenance
dbt Cloud provides a dashboard to monitor the status of your jobs. Keep an eye on the execution logs and any potential errors.

Regularly update your transformations as your data and business requirements evolve.

Conclusion:
Running BigQuery transformations using dbt Cloud streamlines the process of turning raw data into actionable insights. With a clear step-by-step approach, you can easily set up, develop, test, and deploy your transformations to production. This ensures that your organization benefits from accurate and timely data-driven decisions. Harness the power of dbt Cloud to elevate your data transformation capabilities and propel your business forward.

In just a few minutes, you've learned how to leverage dbt Cloud's capabilities to transform data in BigQuery, all while simplifying complex processes and increasing efficiency. So, what are you waiting for? It's time to unlock the potential of your data!

Debugging Python Data Pipelines

James — Sat, 01 Jul 2023 13:52:31 +0000

Introduction:

In this article, we'll explore the process of debugging a Python data pipeline that fetches and stars GitHub repositories related to data engineering. Our pipeline will utilize the GitHub API to fetch repository information, process the data, and star the repositories.

Step 1: Setting Up Logging and Debugging Messages

To begin, let's set up the logging module in Python to get valuable insights into our data pipeline's execution. We'll create a data_pipeline.py file and include the necessary imports and basic configuration for logging.

# data_pipeline.py
import logging

# Configure logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)

# Your GitHub API credentials 
GITHUB_API_TOKEN = 'YOUR_GITHUB_API_TOKEN'

Step 2: Fetching Data from GitHub API

Next, we'll implement the function to fetch GitHub repositories related to data engineering. We'll use the popular requests library to make API calls.

import requests

def fetch_data_from_github():
    url = 'https://api.github.com/search/repositories'
    params = {'q': 'dataengineering', 'sort': 'stars', 'order': 'desc'}

    try:
        response = requests.get(url, params=params, headers={'Authorization': f'token {GITHUB_API_TOKEN}'})
        response.raise_for_status()
        data = response.json()
        return data['items']
    except requests.exceptions.RequestException as e:
        logger.error(f"Failed to fetch data from GitHub API: {e}")
        return []

Step 3: Unit Testing for GitHub API

To ensure the GitHub API function behaves correctly, let's write some unit tests using pytest.

# test_data_pipeline.py
import data_pipeline

def test_fetch_data_from_github():
    # Mock the API response for testing
    data_pipeline.GITHUB_API_TOKEN = 'TEST_TOKEN'
    data_pipeline.requests.get = lambda *args, **kwargs: MockApiResponse()

    repositories = data_pipeline.fetch_data_from_github()
    assert len(repositories) == 2

class MockApiResponse:
    def __init__(self):
        self.status_code = 200

    def json(self):
        return {
            'items': [
                {'name': 'around-dataengineering', 'html_url': 'https://github.com/around-dataengineering'},
                {'name': 'dataengineering', 'html_url': 'https://github.com/dataengineering'}
            ]
        }

Step 4: Star the GitHub Repositories

Now, we'll implement the function to star the fetched GitHub repositories. We'll use the pygithub library, which simplifies working with the GitHub API.

from github import Github

def star_repositories(repositories):
    try:
        github_client = Github(GITHUB_API_TOKEN)
        user = github_client.get_user()

        for repo in repositories:
            repo_obj = github_client.get_repo(repo['name'])
            user.add_to_starred(repo_obj)
            logger.info(f"Starred repository: {repo['name']}")
    except Exception as e:
        logger.error(f"Failed to star repositories: {e}")

Step 5: Debugging with Interactive Debugger (pdb)

Now that we have our main functions implemented, let's use the interactive debugger pdb to trace and inspect the pipeline's execution. We'll add a breakpoint in the star_repositories function and run the pipeline.


import pdb

def star_repositories(repositories):
    try:
        github_client = Github(GITHUB_API_TOKEN)
        user = github_client.get_user()

        for repo in repositories:
            repo_obj = github_client.get_repo(repo['name'])
            pdb.set_trace()  # Set a breakpoint here
            user.add_to_starred(repo_obj)
            logger.info(f"Starred repository: {repo['name']}")
    except Exception as e:
        logger.error(f"Failed to star repositories: {e}")

Step 6: Running the Pipeline and Debugging

Finally, let's run the pipeline and debug it using the pdb interactive debugger. We'll execute the fetch_data_from_github and star_repositories functions in sequence.

if __name__ == '__main__':
    repositories = fetch_data_from_github()
    star_repositories(repositories)

When the pdb breakpoint is hit, you can inspect variable values, step through the code, and identify any issues. Use commands like next (n), step (s), and continue (c) to navigate through the code.

A few considerations

Handling API Rate Limits:

GitHub limits how many requests you can make in a given time. We can handle this by checking the rate limit before making a request and pausing if we hit the limit, preventing errors in the process.

Validating Data:

Sometimes, the data returned by the API might be incomplete or incorrect. By validating the data—ensuring that each repository has the necessary information (like a name and URL)—we can avoid processing bad data and ensure our pipeline runs smoothly.

Retrying Failed Requests:

Network issues can sometimes cause API requests to fail. Instead of just failing the pipeline, we can add a retry mechanism that automatically tries the request again a few times before giving up. This makes the pipeline more resilient to temporary glitches.

Using pyspark to stream data from coingecko API and visualise using dash

James — Sun, 18 Jun 2023 15:29:54 +0000

Spark Streaming

Spark Streaming is a fantastic tool that allows you to process and analyze continuous data streams in real-time. It's built on top of Apache Spark, a popular distributed computing framework. With Spark Streaming, you can handle data as it arrives, enabling near-instantaneous data processing.

Imagine you have a river of data flowing continuously, like tweets from Twitter or sensor readings from IoT devices. Spark Streaming lets you break this data into small, manageable chunks called micro-batches. Each micro-batch represents a short period of time, such as 1 second or 1 minute. These micro-batches are then processed using the powerful Spark engine, enabling you to perform calculations, extract insights, and make informed decisions on the fly.

Here are some key features of Spark Streaming that make it so valuable:

High Throughput: Spark Streaming can handle a massive amount of data coming in at high speeds. It achieves this by leveraging the parallel processing capabilities of Spark, allowing you to process data in parallel across multiple machines.
Fault Tolerance: Spark Streaming is designed to be resilient to failures. If a node in the cluster crashes or a network issue occurs, Spark Streaming automatically handles it by redistributing the work to other available nodes. This ensures that your data processing pipeline remains reliable and uninterrupted.
Scalability: As your data volume grows, Spark Streaming can easily scale to handle the increased load. You can add more machines to your Spark cluster, enabling you to process larger data streams without sacrificing performance.
Windowed Operations: Spark Streaming allows you to perform computations over sliding windows of data. This means you can analyze data over a specific time frame, such as the last 5 seconds or the last 1 hour. It's useful for tasks like calculating moving averages or identifying trends in real-time data.
Integration with Spark Ecosystem: Spark Streaming seamlessly integrates with other components of the Spark ecosystem. This means you can combine streaming data processing with Spark SQL for structured data queries, MLlib for machine learning, and GraphX for graph analytics. The integration opens up endless possibilities for real-time analytics.

To start with, PySpark is a powerful framework that allows you to process large-scale data in a distributed and parallel manner. PySpark Streaming is an extension of PySpark that enables you to process real-time data streams.

PySpark Streaming is the Python API for Spark Streaming, which is a component of Apache Spark. Spark Streaming is a scalable and fault-tolerant stream processing system that allows you to process real-time data streams. It provides high-level APIs for programming in various languages, including Python through PySpark.

PySpark Streaming offers a similar programming model and functionality as Spark Streaming, but with the convenience of using Python as the programming language. It allows you to build real-time data processing applications using the rich ecosystem of Python libraries and tools.

Fetching data from coingecko API and persisting in postgreSQL database

The following is a streaming application that periodically fetches data from the CoinGecko API and writes it to a PostgreSQL database. It utilizes PySpark and psycopg2 libraries for data processing and database interaction respectively.

First, it creates a Spark session and defines the schema for the streaming data. The CoinGecko API endpoint is set as api_url.

import psycopg2
import pandas as pd
import time
from pyspark.streaming import StreamingContext
from pyspark.sql import SparkSession
from pyspark.sql.functions import to_timestamp, unix_timestamp, col
from pyspark.sql.types import StructType, StructField, StringType, DecimalType, LongType

spark = SparkSession.builder.appName("CoinGeckoStreamingApp").getOrCreate()
# Create a StreamingContext with a batch interval of 1 second
ssc = StreamingContext(spark.sparkContext, 60)

schema = StructType([
    StructField("id", StringType(), True),
    StructField("name", StringType(), True),
    StructField("symbol", StringType(), True),
    StructField("current_price", StringType(), True),
    StructField("last_updated", StringType(), True)
])

api_url = "https://api.coingecko.com/api/v3/coins/markets"

Next, there are two important functions defined. fetch_coingecko_data() fetches data from the CoinGecko API using the requests library and returns the JSON response.

def fetch_coingecko_data():
    response = requests.get(api_url, params={"vs_currency": "usd"})
    if response.status_code == 200:
        return response.json()
    else:
        return []

The write_to_postgres(df) function converts the Spark DataFrame to a Pandas DataFrame, establishes a connection to the PostgreSQL database, and writes the data to the "coingecko_data" table.

def write_to_postgres(df):
    pandas_df = df.toPandas()
    # Connection and cursor setup
    # ...
    # Table existence check and creation
    # ...
    # DataFrame column type conversion
    # ...
    # Writing data to PostgreSQL table
    # ...
    # Closing cursor and connection
    # ...

The code then fetches data from the CoinGecko API, creates a Spark DataFrame from the fetched data, performs any required transformations or computations, and writes the DataFrame to PostgreSQL using the write_to_postgres(df) function.Sleeps for 60 seconds before repeating the process.

while True:
    data = fetch_coingecko_data()
    df = spark.createDataFrame(data, schema)
    # Perform transformations/computations on the DataFrame
    # ...
    write_to_postgres(df)
    time.sleep(60)  # Fetch data every 60 seconds

# Start the streaming context
ssc.start()
ssc.awaitTermination()

Visualizing the data using Dash

Below is a Python script that utilizes the Dash framework to create a dashboard for visualizing the cryptocurrency data. Let's go through some of the important snippets:

app = dash.Dash(__name__, external_stylesheets=[dbc.themes.BOOTSTRAP])

initializing the Dash application, setting its name as __name__ and applying a Bootstrap theme to the dashboard.

app.layout = html.Div([
    dbc.Container([
        dbc.Row([
            dbc.Col(
                dcc.Graph(id='price-chart')
            )
        ]),
        # ... More graph layouts ...
        dbc.Row([
            dbc.Col(
                html.Div(id='last-fetched')
            )
        ]),
        dcc.Interval(
            id='interval-component',
            interval=60000,
            n_intervals=0
        )
    ])
])

defining the layout of the dashboard using html.Div, dbc.Container, dbc.Row, and dbc.Col components. Each graph is placed within a dcc.Graph component, and other components like the last fetched text and the interval component are added to the layout.

@app.callback(Output('price-chart', 'figure'), Output('last-fetched', 'children'),
              Input('currency-filter', 'value'), Input('interval-component', 'n_intervals'))
def update_price_chart(currencies, n):
    conn = psycopg2.connect(
        host="localhost",
        port="5432",
        database="coin",
        user="data_eng",
        password="data_eng"
    )
    # ... Fetching and processing data ...
    fig = go.Figure()
    for i, df in enumerate(dfs):
        fig.add_trace(go.Scatter(x=df['last_updated'], y=df['price_change'], mode='lines',
                                 name=f'Price Change - {currencies[i]}'))
    fig.update_layout(title='Price Change Over Time', xaxis_title='Date', yaxis_title='Price Change')
    current_time = datetime.now().strftime('%Y-%m-%d %H:%M:%S')
    last_fetched_text = f"Last fetched: {current_time}"
    return fig, last_fetched_text

defining a callback function update_price_chart, which is triggered when the values of the currency-filter dropdown or the interval-component change. Inside the function, a connection is established with a PostgreSQL database using psycopg2. Data is fetched based on the selected currencies, processed, and stored in DataFrames. The data is then used to generate a line plot using plotly.graph_objects. The function also retrieves the current time and returns the figure and last fetched time text.

Similarly, there are other callback functions defined (update_volume_chart, update_scatter_plot, update_bar_chart, update_pie_chart) that update the respective graphs based on the selected currencies and interval.

Next Steps-deploying the streaming app

When it comes to deploying a PySpark streaming application, there are several options available:

Apache Spark Standalone Cluster: You can deploy your PySpark streaming app on a standalone cluster using Spark's built-in cluster manager. It allows you to submit your application and distribute the workload across multiple machines.
Apache Hadoop YARN: If you have a Hadoop cluster with YARN, you can leverage it to deploy your PySpark streaming app. YARN manages cluster resources, and you can submit your application to YARN for resource allocation and scheduling.
Apache Mesos: Mesos is a cluster manager that simplifies the deployment of distributed applications, including PySpark streaming apps. It handles resource management and scheduling across a cluster of machines.
Cloud Platforms: Cloud platforms such as AWS, Google Cloud, and Azure offer managed Spark services that make it easy to deploy PySpark streaming applications. These services abstract away infrastructure management and provide scalable Spark clusters.
Containerization: Docker and containerization technologies allow you to package your PySpark streaming app with its dependencies into a container. Containers can be deployed on various platforms, providing consistency and portability across different environments.

Consider factors like ease of setup, scalability, and infrastructure management preferences when choosing a deployment option. Starting with a standalone cluster or local deployment is recommended for beginners, gradually exploring cloud platforms or containerization as you gain more experience.

kafka: event driven microservices

James — Thu, 18 May 2023 18:46:43 +0000

Introduction:
Microservices architecture has gained significant popularity due to its ability to build scalable and maintainable systems. In this article, we'll explore the concept of event-driven microservices using Apache Kafka as the central event bus. This approach enables the construction of highly scalable, loosely coupled, and real-time systems.

What are Event-Driven Microservices in Kafka?

Event-driven microservices in Kafka refer to a software architecture pattern where individual microservices communicate asynchronously through events using Apache Kafka as the central event bus. In this pattern, services are decoupled and interact with each other by producing and consuming events.

How does it work?

Event Production:
Each microservice produces events when certain actions or state changes occur within its domain. These events represent meaningful occurrences or updates within the service. Microservices publish these events to Kafka topics, specifying the topic that corresponds to the type of event being produced.
Event Consumption:
Other microservices interested in specific types of events subscribe to the relevant Kafka topics and consume the events. They receive events in the order they were produced and process them independently. Consuming microservices can perform various actions, such as updating their internal state, triggering further business logic, or producing new events in response.
Event Schema and Serialization:
Kafka events are typically serialized in a specific format like JSON or Avro. Microservices need to agree on the schema and serialization format to effectively produce and consume events. Using a schema registry or versioning strategies helps maintain backward compatibility when evolving the event structure.
Event Sourcing and Replay:
Kafka's durability and retention capabilities make it suitable for event sourcing. Event sourcing involves persisting the state of an application as a sequence of events in Kafka. This enables auditing, rebuilding state, and maintaining a historical record of changes. Microservices can replay events to reconstruct their state at any point in time.
Scalability and Fault Tolerance:
Kafka's distributed nature allows for high scalability and fault tolerance. Multiple instances of microservices can consume events from Kafka topics in parallel, enabling horizontal scaling. Kafka's replication ensures data durability and fault tolerance, ensuring events are not lost even in the event of failures.
Event-Driven Processing and Analytics:
Event-driven microservices architecture allows for real-time processing and analytics on the event stream. Microservices can analyze patterns, generate insights, and trigger actions based on events they consume. For example, services might detect anomalies, generate alerts, update real-time dashboards, or feed data into machine learning models for predictions.

Integrating with Webhooks and database

Webhooks and databases can be integrated into event-driven microservices architecture to enhance the functionality and enable seamless communication and data persistence. Here's how they can be used:

Webhooks:
Webhooks are a way for applications to receive real-time notifications or callbacks when specific events occur. They can be integrated into event-driven microservices as follows:
Event Notification: Instead of directly consuming events from Kafka topics, a microservice can register a webhook callback URL with another microservice or third-party service. When a relevant event occurs, the producing microservice publishes the event to Kafka and also triggers a webhook notification to the registered URL. The consuming microservice can then process the event by handling the webhook request.

External Service Integration: Webhooks can be used to integrate with external services that don't natively support Kafka. For example, when an event occurs in your microservice, you can publish the event to Kafka and simultaneously send a webhook notification to an external service to keep them updated in real-time.

Decoupled Communication: Webhooks provide a loosely coupled communication mechanism between microservices. Instead of direct service-to-service communication, one microservice can notify another through webhooks, allowing the services to evolve independently and reducing tight coupling.

Database Integration:
Databases play a crucial role in event-driven microservices architecture for data persistence and maintaining application state. Here's how databases can be used:
Stateful Microservices: Some microservices may require maintaining their state for various reasons. Databases can be used to store and retrieve the state information of these microservices. When events are consumed, the microservice can update its state in the database accordingly.

Event Sourcing: Databases are often used for event sourcing, where events are stored in an event log or event store. Instead of relying solely on Kafka, events can be persisted in a database to support event replay, auditing, and rebuilding the state of microservices.

Data Enrichment: Microservices may need to enrich the consumed events with additional data from external sources or reference data. Databases can be used to store and retrieve this additional data, allowing microservices to enrich the event data during event processing.

Caching: Databases can be used as a caching layer to improve performance and reduce the load on microservices. Microservices can cache frequently accessed data from events in the database, avoiding repeated processing of the same events.

Creating an event driven microservice and adding webhooks and database.

You can find the entire code here github.com/James-Wachuka/event-driven-microservices

example: Python code that publishes Kafka events and sends webhook notifications.

from kafka import KafkaProducer
import requests
import json

# Kafka producer configuration
bootstrap_servers = 'localhost:9092'

# Webhook URLs
user_created_webhook = 'http://localhost:5000/webhook/user_created'
order_placed_webhook = 'http://localhost:5000/webhook/order_placed'

# Create Kafka producer
producer = KafkaProducer(bootstrap_servers=bootstrap_servers,
                         value_serializer=lambda v: json.dumps(v).encode('utf-8'))

# Publish user created event
user = {'id': 11, 'name': 'King'}
producer.send('user_created', value=user)

# Send webhook notification for user created event
requests.post(user_created_webhook, json=user)

# Publish order placed event
order = {'id': 11, 'product': 'sofa', 'amount': 100000}
producer.send('order_placed', value=order)

# Send webhook notification for order placed event
requests.post(order_placed_webhook, json=order)

# Close the producer connection
producer.close()

A Flask app with two webhook endpoints to handle user created and order placed events.

from flask import Flask, request

app = Flask(__name__)

@app.route('/webhook/user_created', methods=['POST'])
def handle_user_created_webhook():
    payload = request.get_json()
    # Perform necessary actions or trigger other processes based on the user created event
    print('New user created:', payload)
    # ...
    return 'Webhook received and processed successfully', 200

@app.route('/webhook/order_placed', methods=['POST'])
def handle_order_placed_webhook():
    payload = request.get_json()
    # Perform necessary actions or trigger other processes based on the order placed event
    print('New order placed:', payload)
    # ...
    return 'Webhook received and processed successfully', 200

if __name__ == '__main__':
    app.run()

database example: The code below establishes a connection to a PostgreSQL database and creates tables for users and orders. It uses Kafka consumers to consume messages from topics 'user_created' and 'order_placed'. The consumed messages are inserted into the corresponding database tables, committing the changes to the database.

# Consume user_created and order_placed events
for message_1, message_2 in zip(user_consumer,order_consumer):

    user = message_1.value
    order = message_2.value

    cursor = conn.cursor()
    # Insert user data into the database
    user_query = f"INSERT INTO users (id, name) VALUES ({user['id']}, '{user['name']}')"
    order_query = f"INSERT INTO orders (id, product, amount) VALUES ({order['id']}, '{order['product']}', {order['amount']})"

    cursor.execute(user_query)
    print('New user created:', user)
    cursor.execute(order_query)
    print('New order placed:', order)


    conn.commit()
    cursor.close()


# Close the database connection
conn.close()

using a flask app to perform CRUD operations on the shared database:


# API endpoint to update user information
@app.route('/users/<user_id>', methods=['PUT'])
def update_user(user_id):
    try:
        # Extract updated user information from the request
        user_data = request.get_json()
        name = user_data['name']

        # Update user information in the database
        cursor = conn.cursor()
        query = f"UPDATE users SET name = '{name}' WHERE id = {user_id}"
        cursor.execute(query)
        conn.commit()

        # Publish user_updated event to Kafka
        event_data = {'id': int(user_id), 'name': name}
        producer.send('user_updated', value=event_data)

        return jsonify({'message': 'User updated successfully'})

    except Exception as e:
        return jsonify({'error': str(e)}), 500

Event-driven microservices with Kafka offer several benefits that make them significant in real-world scenarios:

Scalability and Flexibility:
By decoupling services through events, the architecture becomes more scalable and flexible. Each service can be developed, deployed, and scaled independently, allowing teams to work on different services concurrently. Services can be added, modified, or removed without impacting the entire system. Kafka's distributed nature ensures that events are reliably delivered to all interested services, even in high-traffic scenarios.

Loose Coupling and Resilience:
The event-driven approach promotes loose coupling between services. Services only need to know the structure of the events they produce and consume, not the specific implementation details of other services. This loose coupling makes the system more resilient to changes, as services can evolve independently without disrupting others. If a service is temporarily unavailable, events can be stored in Kafka until the service recovers.

Real-Time Processing and Analytics:
With an event-driven architecture powered by Kafka, it becomes easier to perform real-time processing and analytics on the event stream. Services can consume events, analyze patterns, generate insights, and trigger actions in real-time. For example, a service might detect anomalies, generate alerts, update real-time dashboards, or feed data into machine learning models for predictions.

Integration and Ecosystem:
Kafka has a rich ecosystem and supports a wide range of connectors, frameworks, and tools. This makes it easier to integrate with other systems and services, such as databases, data warehouses, stream processing frameworks, and monitoring tools. Kafka Connect enables seamless integration with external systems, while Kafka Streams and other stream processing frameworks provide powerful capabilities for data processing and transformations.

Conclusion:
Event-driven microservices with Kafka provide a powerful approach to building scalable, resilient, and loosely coupled systems.This pattern is widely adopted in various domains, including e-commerce, finance, telecommunications, logistics, and IoT, where responsiveness, scalability, and adaptability are crucial for success.

kafka: distributed task queue

James — Wed, 17 May 2023 12:35:59 +0000

Apache Kafka is an open-source distributed streaming platform developed by the Apache Software Foundation. It is designed to handle high-throughput, fault-tolerant, and real-time data streaming scenarios. Kafka is built to handle large volumes of data streams, making it suitable for use cases such as real-time data pipelines, event sourcing, messaging systems, log aggregation, and more.

Key concepts in Kafka:

Topics: Topics are the core abstraction in Kafka and represent a specific stream of records. Producers publish messages to topics, and consumers subscribe to topics to consume those messages.
Producers: Producers are responsible for publishing messages to Kafka topics. They write messages to specific topics, which are then stored and made available for consumption by consumers.
Consumers: Consumers subscribe to Kafka topics and consume messages from those topics in real-time. Multiple consumers can be part of a consumer group, where each consumer in the group reads from a different subset of partitions for parallel processing.
Partitions: Topics can be divided into multiple partitions, allowing for parallelism and scalability. Each partition is an ordered, immutable sequence of messages. Kafka distributes the partitions across different brokers in a Kafka cluster.
Brokers: Brokers form the Kafka cluster and are responsible for receiving messages from producers, storing them on disk, and serving them to consumers. Kafka brokers ensure fault tolerance by replicating partitions across multiple brokers.
ZooKeeper: Kafka relies on Apache ZooKeeper for cluster coordination, managing metadata, and maintaining broker and consumer group information.
Connect: Kafka Connect is a framework for scalable and reliable integration of Kafka with external systems. It simplifies the development and management of connectors for data import/export to/from Kafka.

distributed task queue system- Using Kafka to distribute tasks across multiple workers. Use Kafka to enqueue tasks, distribute them among workers, and track the status and results of task execution. This can be useful for implementing parallel processing or load balancing in data processing workflows.The entire code is contained here github.com/James-Wachuka/python-kafka_distributed_task_queue

example of consumer code

import logging
from kafka import KafkaConsumer, KafkaProducer

# Kafka configuration
bootstrap_servers = 'localhost:9092'
task_topic = 'task_topic'
result_topic = 'result_topic'

# Configure logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

# Create Kafka consumer and producer
consumer = KafkaConsumer(task_topic, bootstrap_servers=bootstrap_servers)
producer = KafkaProducer(bootstrap_servers=bootstrap_servers)

# Process tasks
for message in consumer:
    task = message.value.decode('utf-8')
    logger.info(f'Received task: {task}')

    # Perform task processing logic here
    result = task.upper()  # Example: Uppercase the task
    logger.info(f'Processed task: {task} --> Result: {result}')

    # Send the result to the result topic
    producer.send(result_topic, result.encode('utf-8'))
    producer.flush()
    logger.info(f'Result sent to {result_topic}')

the above code shows a Kafka consumer that listens to the task_topic and receives incoming tasks. Each task is processed by applying some logic (in this case, converting the task to uppercase) and then sending the result to the result_topic using a Kafka producer.

producer code

from kafka import KafkaProducer

# Kafka producer configuration
bootstrap_servers = 'localhost:9092'
task_topic = 'task_topic'

# Create Kafka producer
producer = KafkaProducer(bootstrap_servers=bootstrap_servers)

# Enqueue tasks
tasks = ['task1', 'task2', 'task3']  # Example tasks
for task in tasks:
    # Enqueue the task to the task topic
    producer.send(task_topic, task.encode('utf-8'))
    producer.flush()
    print(f"Task enqueued: {task}")

# Close the producer connection
producer.close()

In this code, we create a Kafka producer to enqueue tasks. We define a list of tasks, and each task is sent to the task_topic using the producer.

To run this example, start multiple instances of the worker code in separate terminals. Then, execute the enqueuing code to send tasks to the Kafka topic. The workers will consume the tasks, process them, and send the results to the result topic.

setup

download and extract kafka
build the kafka project -inside the kafka folder run

./gradlew jar -PscalaVersion=2.13.10
Start ZooKeeper:

bin/zookeeper-server-start.sh config/zookeeper.properties
start kafka brokers:

bin/kafka-server-start.sh config/server.properties
install kafka-python
create kafka topics:

bin/kafka-topics.sh --create --topic task_topic --bootstrap-server localhost:9092 --partitions 1 --replication-factor 1 bin/kafka-topics.sh --create --topic result_topic --bootstrap-server localhost:9092 --partitions 1 --replication-factor 1
run consumer.py and producer.py in separate terminals
Verify Output: The consumer will process the tasks produced by the producer and print the results to the console.

Customization:adding error handling and task acknowledgement

To add error handling and task acknowledgement to the Kafka implementation,
you can make use of Kafka's message acknowledgment feature
and handle any potential exceptions that may occur during processing
We set enable_auto_commit=False when creating the Kafka consumer to disable automatic offset commits.
This allows us to manually commit the offset only after successfully processing a task.
After processing a task, we use the add_callback() method on the Kafka producer's send() call to add a callback function that will be executed when the result is successfully sent to the result topic.
In the callback, we print a success message indicating that the task was processed and the result was sent.
We also use the add_errback() method to add an error callback function that will be executed if there is an error while sending the result to the result topic.
In the error callback, we print an error message indicating the failure.
After sending the result and before committing the offset, we explicitly call consumer.commit() to manually commit the offset, marking the task as processed.

Customization:enhancing the logic

enhanced consumer code

import time
from kafka import KafkaConsumer

# Kafka configuration
bootstrap_servers = 'localhost:9092'
task_topic = 'task_topic'
result_topic = 'result_topic'

# Create Kafka consumer
consumer = KafkaConsumer(task_topic, bootstrap_servers=bootstrap_servers)

# Start consuming messages
for message in consumer:
    task = message.value.decode('utf-8')
    print(f"Received task: {task}")

    # Simulate time-consuming task processing
    time.sleep(5)  # Delay of 5 seconds

    # Perform task processing logic here
    result = task.upper()  # Example: Uppercase the task

    # Send the result to the result topic
    print(f"Task processed: {task} --> Result: {result}")

In this updated code, we've added a time.sleep(5) statement to simulate a time-consuming task that takes 5 seconds to process.
You can modify the sleep duration to match your desired processing time.
With this enhancement, each task received by the worker will undergo a delay of 5 seconds before processing.
This can simulate scenarios where tasks require significant computation or external resource access.

The Kafka implementation described above, which is a distributed task queue, has several significant applications and benefits in the real world:

Scalable and Fault-tolerant Task Processing: Kafka's distributed nature allows the implementation to handle high-volume task processing with scalability and fault tolerance. By running multiple worker consumers, you can distribute the workload across multiple machines or processes, achieving parallel processing and load balancing.
Real-time Data Processing: Kafka provides real-time data streaming capabilities. By using Kafka as the underlying messaging system for task distribution, you can process tasks as they arrive, enabling real-time data processing and reducing latency.
Microservices Architecture: Kafka is commonly used as a communication bus in microservices architectures. In this context, the distributed task queue can be used to coordinate and distribute tasks among various microservices. Each microservice can subscribe to the task topic, process the tasks independently, and publish the results to other topics or services.
Event-driven Architectures: Kafka enables event-driven architectures where systems react to events asynchronously. The task queue can be utilized to handle event-driven tasks, allowing systems to react to events in real-time and trigger corresponding actions or workflows.
Big Data Processing: Kafka is commonly used in big data processing pipelines. By incorporating the distributed task queue in such pipelines, you can distribute data processing tasks across multiple workers, enabling efficient and parallel processing of large datasets.
Workflow Orchestration: The task queue can be integrated into workflow orchestration systems, where tasks represent individual steps or actions in a larger workflow. The distributed task queue allows for efficient coordination, tracking, and monitoring of tasks in complex workflows.
Real-time Analytics and Monitoring: By processing tasks in real-time and generating results or metrics, the distributed task queue can facilitate real-time analytics and monitoring. This can be useful for applications such as fraud detection, anomaly detection, real-time analytics dashboards, and system monitoring.

Overall, the Kafka implementation of a distributed task queue provides a flexible and scalable solution for handling tasks and processing data in real-time, making it a valuable component in various real-world scenarios, including microservices, big data processing, event-driven architectures, and real-time analytics.

Building a Weather Data Pipeline with PySpark, Prefect, and Google Cloud

James — Mon, 01 May 2023 15:02:55 +0000

In this article, I'll walk you through how to build a data pipeline that fetches weather data for multiple cities, processes the data using PySpark, and stores the output in Google Cloud.

We'll be using PySpark for distributed data processing, Prefect for workflow management, and Google Cloud Storage and BigQuery for data storage and processing.The code is available on github.
looker dashboard

The pipeline fetches weather data for multiple cities from the OpenWeatherMap API using requests library. The data is then processed and filtered using PySpark, and the output is stored as a CSV file in Google Cloud Storage. Finally, the data is loaded into a BigQuery table for further analysis.

pipeline

from pyspark.sql import SparkSession
from google.cloud import storage
from google.cloud import bigquery
import requests
import json
from google.cloud.exceptions import NotFound
import random


# Define the OpenWeatherMap API key and base URL
api_key = ""
base_url = "https://api.openweathermap.org/data/2.5/weather"

cities_100 = ['New York', 'London', 'Paris', 'Tokyo', 'Sydney', 'Moscow', 'Beijing', 'Rio de Janeiro', 'Mumbai', 'Cairo', 'Rome', 'Berlin', 'Toronto', 'Lagos', 'Bangkok', 'Melbourne', 'Johannesburg']

cities = random.sample(cities_100, 25)

# Initialize a Spark session
spark = SparkSession.builder.appName("WeatherData").getOrCreate()

# Define a function to fetch weather data for a city and return a Spark dataframe
def fetch_weather_data(city):
    # Send a request to the OpenWeatherMap API for the city's weather data
    params = {"q": city, "appid": api_key, "units": "metric"}
    response = requests.get(base_url, params=params)
    data = response.json()

    # Extract the relevant weather data from the API response
    temp = data["main"]["temp"]
    humidity = data["main"]["humidity"]
    wind_speed = data["wind"]["speed"]

    # Create a Spark dataframe with the weather data for the city
    df = spark.createDataFrame([(city, temp, humidity, wind_speed)],
                               ["City", "Temperature", "Humidity", "WindSpeed"])
    return df

# Use the fetch_weather_data function to fetch weather data for all cities and merge them into a single dataframe
weather_data = None
for city in cities:
    city_weather_data = fetch_weather_data(city)
    if weather_data is None:
        weather_data = city_weather_data
    else:
        weather_data = weather_data.union(city_weather_data)

# Perform some basic processing and transformation on the weather data using PySpark
weather_data = weather_data.filter("Temperature > 10") \
                           .groupBy("City") \
                           .agg({"Humidity": "avg", "WindSpeed": "max"}) \
                           .withColumnRenamed("avg(Humidity)", "AverageHumidity") \
                           .withColumnRenamed("max(WindSpeed)", "MaxWindSpeed")

# Show the final processed and transformed weather data
weather_data.show()

# Write the weather data as a CSV file to a Google Cloud Storage bucket
bucket_name = "weather_app_dez"
file_name = "weather_data.csv"
storage_client = storage.Client()
bucket = storage_client.bucket(bucket_name)
blob = bucket.blob(file_name)
blob.upload_from_string(weather_data.toPandas().to_csv(index=False), content_type="text/csv")

# Create a new BigQuery table and load the data from the CSV file
table_name = "dez-dtc-23-384116.weather_app.weather_data"
bigquery_client = bigquery.Client()
table = bigquery.Table(table_name)
schema = [bigquery.SchemaField("City", "STRING"),
          bigquery.SchemaField("AverageHumidity", "FLOAT"),
          bigquery.SchemaField("MaxWindSpeed", "FLOAT")]
table.schema = schema
job_config = bigquery.LoadJobConfig(source_format=bigquery.SourceFormat.CSV, skip_leading_rows=1, autodetect=True)
job = bigquery_client.load_table_from_uri(f"gs://{bucket_name}/{file_name}", table, job_config=job_config)
job.result()
print(f"Loaded {job.output_rows} rows into BigQuery table {table_name}")

adding prefect functionality

from google.cloud import storage, bigquery
import requests
import json
from google.cloud.exceptions import NotFound
import random
from prefect import task, Flow
from pyspark.sql import SparkSession


# Define the OpenWeatherMap API key and base URL
api_key = ""
base_url = "https://api.openweathermap.org/data/2.5/weather"

spark = SparkSession.builder.appName("WeatherData").getOrCreate()

# Define the cities list
cities_100 = ['New York', 'London', 'Paris', 'Tokyo', 'Sydney', 'Moscow', 'Beijing', 'Rio de Janeiro', 'Mumbai', 'Cairo']
cities = random.sample(cities_100, 5)

# Define a function to fetch weather data for a city and return a Spark dataframe
@task
def fetch_weather_data(cities):
    weather_data = None
    for city in cities:
        # Send a request to the OpenWeatherMap API for the city's weather data
        params = {"q": city, "appid": api_key, "units": "metric"}
        response = requests.get(base_url, params=params)
        data = response.json()

        # Extract the relevant weather data from the API response
        temp = data["main"]["temp"]
        humidity = data["main"]["humidity"]
        wind_speed = data["wind"]["speed"]

        # Create a Spark dataframe with the weather data for the city
        city_weather_data = spark.createDataFrame([(city, temp, humidity, wind_speed)],
                                ["City", "Temperature", "Humidity", "WindSpeed"])
        if weather_data is None:
            weather_data = city_weather_data
        else:
            weather_data = weather_data.union(city_weather_data)
    return weather_data


'''
# Use the fetch_weather_data function to fetch weather data for all cities and merge them into a single dataframe
@task
def merge_weather_data(cities):
    weather_data = None
    for city in cities:
        city_weather_data = fetch_weather_data(city)
        if weather_data is None:
            weather_data = city_weather_data
        else:
            weather_data = weather_data.union(city_weather_data)
    return weather_data
'''
# Perform some basic processing and transformation on the weather data using PySpark
@task
def process_weather_data(weather_data):
    processed_data = weather_data.filter("Temperature > 10") \
                           .groupBy("City") \
                           .agg({"Humidity": "avg", "WindSpeed": "max"}) \
                           .withColumnRenamed("avg(Humidity)", "AverageHumidity") \
                           .withColumnRenamed("max(WindSpeed)", "MaxWindSpeed")
    return processed_data

# Write the weather data as a CSV file to a Google Cloud Storage bucket
@task
def write_weather_data_to_gcs(weather_data):
    bucket_name = "weather_app_dez"
    file_name = "weather_data.csv"
    storage_client = storage.Client()
    bucket = storage_client.bucket(bucket_name)
    blob = bucket.blob(file_name)
    blob.upload_from_string(weather_data.toPandas().to_csv(index=False), content_type="text/csv")
    return f"gs://{bucket_name}/{file_name}"

# Create a new BigQuery table and load the data from the CSV file
@task
def write_weather_data_to_bigquery(uri):
    table_name = "dez-dtc-23-384116.weather_app.weather_data"
    bigquery_client = bigquery.Client()
    table = bigquery.Table(table_name)
    schema = [bigquery.SchemaField("City", "STRING"),
              bigquery.SchemaField("AverageHumidity", "FLOAT"),
              bigquery.SchemaField("MaxWindSpeed", "FLOAT")]
    table.schema = schema
    job_config = bigquery.LoadJobConfig(source_format=bigquery.SourceFormat.CSV, skip_leading_rows=1, autodetect=True)
    job = bigquery_client.load_table_from_uri(uri, table, job_config=job_config)
    job.result()
    print(f"Loaded {job.output_rows} rows into BigQuery table {table_name}")

with Flow("Weather Data Pipeline") as flow:

    cities = random.sample(cities_100, 5)
    weather_dat = fetch_weather_data(cities)
    processed_data = process_weather_data(weather_dat)
    uri = write_weather_data_to_gcs(processed_data)
    write_weather_data_to_bigquery(uri)

flow.run()

To run the pipeline, you'll need to have a Google Cloud account with billing enabled or free tier, and the necessary API keys for accessing the OpenWeatherMap API.

Configuration

Create a new Google Cloud Storage bucket to store the output data. Create a new BigQuery dataset and table to store the output data. Update the bucket_name and table_name variables in the write_weather_data_to_gcs and write_weather_data_to_bigquery tasks, respectively, with the appropriate names of the bucket and table you created.

Running the Pipeline

To run the pipeline, follow these steps: Open a terminal window and navigate to the project directory. Build the Docker image using the following command: docker build -t weather-data-pipeline . Run the Docker container using the following command: docker run--rm -it -v $(pwd):/app -e GOOGLE_APPLICATION_CREDENTIALS=/app/your-credentials.json weather-data-pipeline Note: Replace your-credentials.json with the name of your Google Cloud Platform service account key file. The pipeline will run and the output data will be written to the Google Cloud Storage bucket and BigQuery table you specified in the configuration step.

Troubleshooting

If you encounter any issues while running the pipeline, please check the following:

Ensure that the Google Cloud Platform credentials you specified are valid and have the appropriate permissions to access GCS and BigQuery. Ensure that the bucket and table names you specified in the configuration step are correct. Check the logs for any error messages that might indicate the cause of the issue.

In conclusion, building a data pipeline that fetches and processes weather data using PySpark, Prefect, and Google Cloud is an exciting project that showcases the power of these technologies. With this pipeline, you can easily collect and analyze weather data for multiple cities, and use it for various applications such as predictive modeling and weather forecasting.

prefect vs apache airflow

James — Fri, 17 Feb 2023 17:13:58 +0000

Workflow orchestration is the process of designing, executing, and monitoring complex workflows that involve multiple tasks or processes. This can involve coordinating a wide range of activities, from data ingestion, transformation, and analysis, to model training and deployment, to report generation and delivery.

Workflow orchestration typically involves defining a set of tasks or activities that need to be executed, specifying their dependencies, and determining the order in which they should be executed. This can be done using a workflow management system, which provides a graphical interface for designing workflows and a runtime environment for executing them.

Workflow orchestration has become increasingly important in the context of big data and machine learning, where the volume and complexity of data processing tasks require a coordinated approach. It can help to ensure that data is processed correctly and efficiently, and that the results are delivered in a timely and reliable manner.

Some popular workflow management systems include Apache Airflow, Prefect, Apache NiFi, Luigi, Azkaban, and Oozie. These systems provide a range of features for workflow orchestration, including task scheduling, dependency management, job monitoring, and error handling.

Prefect is an open-source workflow management system that helps you define, schedule, and orchestrate complex data pipelines. It is written in Python and is designed to be flexible, extensible, and easy to use.

One of the key features of Prefect is its focus on building workflows as Python functions. This allows you to define your workflows using familiar programming constructs, and to reuse existing Python code in your workflows. Prefect also provides a range of tools and utilities for working with data, such as data loaders, transformation functions, and caching.

Prefect provides a number of key benefits, including:

Scalability:
Prefect is designed to scale horizontally, so you can easily run your workflows across a cluster of machines or in a cloud environment.
Monitoring and debugging:
Prefect provides real-time monitoring of your workflows, with detailed logging and error reporting. This makes it easy to identify and diagnose problems in your workflows.
Flexibility:
Prefect is designed to be highly configurable, so you can customize it to your specific needs. You can easily integrate Prefect with other systems and tools, such as databases, message queues, and cloud services.
Ease of use:
Prefect provides a user-friendly interface for building and managing workflows, with a range of pre-built components and templates to help you get started.

Apache airflow is an open-source workflow management system that is widely used for managing complex data processing workflows. It is written in Python and uses a directed acyclic graph (DAG) to define and execute workflows.

One of the key features of Airflow is its emphasis on modularity and extensibility. It provides a range of built-in operators for performing tasks, and you can also create your own custom operators to perform tasks that are specific to your workflow. Airflow also provides a web-based user interface for managing workflows, which makes it easy to schedule and monitor jobs.

Some of the key benefits of Airflow include:

Scalability:
Airflow is designed to scale horizontally, which means you can easily run your workflows across a cluster of machines or in a cloud environment.
Flexibility:
Airflow provides a rich set of features and can be customized to work with a wide range of systems and tools, including databases, message queues, and cloud services.
Monitoring and debugging:
Airflow provides real-time monitoring of your workflows, with detailed logging and error reporting. This makes it easy to identify and diagnose problems in your workflows.
Community support:
Airflow has a large and active community of users and contributors, which means you can benefit from a wide range of plugins, integrations, and other resources.

prefect vs airflow
Prefect and Apache Airflow are both open-source workflow management systems that help you define, schedule, and orchestrate complex data pipelines. Here are some key differences between the two:

Programming Language:
Prefect is written in Python, while Airflow is written in Python and uses a custom domain-specific language (DSL) called "DAGs" (Directed Acyclic Graphs) to define workflows.
Architecture:
Prefect is a modular system that is designed to be flexible and extensible, while Airflow is a monolithic system that has a fixed architecture.
Ease of Use:
Prefect is generally considered to be more user-friendly than Airflow, with a simpler API and a more intuitive user interface.
Scalability:
Both systems are designed to scale horizontally, but Prefect is known to be particularly good at handling large, complex workflows.

Ultimately, choosing between Prefect and Airflow depends on your specific use case and needs. If ease of use and flexibility are your priorities, Prefect may be a better choice. If you want an established system with a large community and many plugins available, Airflow may be a better choice.

BigQuery: Creating a pipeline between MySql and BigQuery using airflow

James — Sun, 11 Dec 2022 13:00:22 +0000

What is bigquery?

BigQuery is a fully managed enterprise data warehouse that provides built-in features such as machine learning, geospatial analytics, and business intelligence to help you manage and analyze your data. BigQuery is used to store and analyze data. Federated queries read data from external sources, whereas streaming allows for continuous data updates. This data can be analyzed and understood using powerful tools such as BigQuery ML and BI Engine. BigQuery uses a columnar storage format that is optimized for analytical queries to store data. BigQuery displays data in table, row, and column formats and supports full database transactional semantics.

Setting up the environment

Sign up to google cloud platform and create a project,select the project you want to use within your Google Cloud Console. This is typically a drop-down menu in the Google Cloud console nav. Enable your BigQuery API for the selected project.Create a Service Account and IAM policy that allows access to BigQuery within your project. Generate the keys preferably in json format.

Accessing bigquery from python

To access bigquery use the google-cloud-bigquery python library.Create a client, use the given credentials and connect to bigquery.

bigquery dataset
Before loading data in biquery, a create_dataset method is used to create a dataset in bigquery where the tables/data will be stored

Extracting data from local db-MySql

MySql contains a classicmodels db which is a database for a car retail company. From the database we can extract the following.

customers who made the most orders
products that have the highest number of purchases
customers who have spent more. This data is then extracted and transformed using pandas. airflow.hooks.mysql_hook import MySqlHook is used to connect to local db (MySql) from airflow. Setting host to host.docker.internal enables access to local db. Connection string/properties should be set in airflow web UI under admin/connections menu

# connecting to local db to query classicmodels db
    mysql_hook = MySqlHook(mysql_conn_id='mysql_default', schema='classicmodels')
    connection = mysql_hook.get_conn()

Inserting into BigQuery

client.load_table_from_dataframe(df,'table_name') is a method used to insert data into biquery tables using dataframes created from queries and tables_names of the target tables in bigquery.

Automating with airflow

The job runs every 20 minutes. The ETL is seperated into 3 tasks creating_dataset >> truncating_tables >> inserting_data which are executed using PythonVirtualenvOperator in airflow

airflow is run in docker. A volume ./data:/opt/airflow/data is added into the docker-compose file to store the classicmodels db and the json file that contains google application credentials.

Entire code for the dag


import airflow
from airflow import DAG
from airflow.operators.python import PythonOperator # for executing python functions
from airflow.operators.python import PythonVirtualenvOperator # for working with venvs airflow
from airflow.hooks.mysql_hook import MySqlHook # for connecting to local db
from datetime import timedelta

# function to create a dataset in bigquery to store data
def create_dataset():
    from google.cloud import bigquery
    from google.cloud.exceptions import NotFound
    import os

    # setting application credentials to access biqguery
    os.environ['GOOGLE_APPLICATION_CREDENTIALS']= "data/introduction-to-gcp.json"
    client = bigquery.Client()

    """
    Create a dataset in Google BigQuery if it does not exist.
    :param dataset_id: Name of dataset
    :param region_name: Region name for data center, i.e. europe-west2 for London
    """
    dataset_id = 'dataset'
    region_name = 'europe-west2'

    reference = client.dataset(dataset_id)

    try:
        client.get_dataset(reference)
    except NotFound:
        dataset = bigquery.Dataset(reference)
        dataset.location = region_name

        dataset = client.create_dataset(dataset)


# function to truncate tables before inserting data
def truncate():
    from google.cloud import bigquery
    import os

    # setting application credentials to access biqguery
    os.environ['GOOGLE_APPLICATION_CREDENTIALS']= "data/introduction-to-gcp.json"

    # tables to truncate in biquery (*this service task is billed)
    table1 = 'dataset.product_dem'
    table2 = 'dataset.toporders'
    table3 = 'dataset.customer_spe'

    # Truncate a Google BigQuery table
    client = bigquery.Client()
    query1 = ("DELETE FROM "+ table1 +" WHERE 1=1")
    query2 = ("DELETE FROM "+ table2 +" WHERE 1=1")
    query3 = ("DELETE FROM "+ table3 +" WHERE 1=1")

    job_config = bigquery.QueryJobConfig(use_legacy_sql=False)
    query_job1 = client.query(query1, job_config=job_config)
    query_job2 = client.query(query2, job_config=job_config)
    query_job3 = client.query(query3, job_config=job_config)

# function to insert data from a pandas df to bigquery
def insert():
    from google.cloud import bigquery
    from airflow.hooks.mysql_hook import MySqlHook # for connecting to local db
    import pandas as pd
    import os

    #setting application credentials to access biqguery
    os.environ['GOOGLE_APPLICATION_CREDENTIALS']= "data/introduction-to-gcp.json"

    # connecting to local db to query classicmodels db
    mysql_hook = MySqlHook(mysql_conn_id='mysql_default', schema='classicmodels')
    connection = mysql_hook.get_conn()

    # querying the source db
    # products with highest number purchase
    query1 =""" 
    SELECT productName , SUM(quantityOrdered) AS quantity_ordered\
        FROM  products, orderdetails\
        WHERE products.productCode = orderdetails.productCode\
        GROUP BY productName\
        ORDER BY quantity_ordered DESC\
        LIMIT 20;
        """

    # customers that have made the most orders
    query2 = """
    SELECT contactFirstName, contactLastName , COUNT(*) AS number_of_orders\
        FROM  customers, orders\
        WHERE customers.customerNumber = orders.customerNumber\
        GROUP BY customerName\
        ORDER BY number_of_orders DESC\
        LIMIT 20;
        """

    # customers that have spent more
    query3 = """ 
    SELECT contactFirstName , contactLastName, SUM(quantityOrdered*priceEach) AS total_amount_spent\
        FROM  customers, orders, orderdetails\
        WHERE customers.customerNumber = orders.customerNumber AND orderdetails.orderNumber= orders.orderNumber\
        GROUP BY customerName\
        ORDER BY total_amount_spent DESC\
        LIMIT 10;
        """

    sql_query1 = pd.read_sql_query(query1, connection)
    sql_query2 = pd.read_sql_query(query2, connection)
    sql_query3 = pd.read_sql_query(query3, connection)
    df1 = pd.DataFrame(sql_query1)
    df2 = pd.DataFrame(sql_query2)
    df3 = pd.DataFrame(sql_query3)
    client = bigquery.Client()
    # load the data to bigquery tables
    client.load_table_from_dataframe(df1, 'dataset.product_dem')
    client.load_table_from_dataframe(df2, 'dataset.toporders')
    client.load_table_from_dataframe(df3, 'dataset.customer_spe')


def message():
    print("data successfully loaded into gc-bigquery")


default_args = {
 'owner':'airflow',
 'depends_on_past' : False,
 'start_date': airflow.utils.dates.days_ago(7),
}

mysql_to_gcp = DAG(
    'mysql-to-gcp', #name of the dag
    default_args = default_args,
    schedule_interval = timedelta(minutes=20),
    catchup = False
)


creating_dataset = PythonVirtualenvOperator(
    task_id='creating-a-dataset-in-bigquery',
    python_callable = create_dataset,
    requirements = ["google-cloud-bigquery","google-cloud-bigquery-storage"],
    system_site_packages=False,
    dag = mysql_to_gcp,
)

truncating_tables = PythonVirtualenvOperator(
    task_id='truncating-tables-in-bigquery',
    python_callable = truncate,
    requirements = ["google-cloud-bigquery","google-cloud-bigquery-storage"],
    dag = mysql_to_gcp,
)

inserting_data = PythonVirtualenvOperator(
    task_id='inserting-data-into-bigquery',
    python_callable = insert,
    requirements = ["google-cloud-bigquery","google-cloud-bigquery-storage"],
    dag = mysql_to_gcp,
)


message_out = PythonOperator(
    task_id = 'task-complete-message',
    python_callable = message,
    dag = mysql_to_gcp,
)

creating_dataset >> truncating_tables >> inserting_data >> message_out

Airflow-Bigquery integration

Airflow provides a variety of operators, such as Airflow BigQuery Operators, to assist you in managing your data. Airflow BigQuery Operators, in particular, are widely used because they aid in data management by analyzing and extracting meaningful insights. You can use Airflow BigQuery Operators to do the following:

Control Datasets
Table Management
Run BigQuery Jobs
Validate the Data

Airflow : create your first airflow DAG using tweepy

James — Sun, 23 Oct 2022 08:56:45 +0000

What is Apache airflow?
A data pipeline is a series of data processing tasks that must run between the source system and the target system to automate data movement and transformation.

Apache Airflow is a batch oriented tool for data pipeline generation. It is used to programmatically create, schedule, and monitor data pipelines commonly known as workflow orchestration. Airflow is an open source platform used to manage various tasks related to data processing in a data pipeline.

How does Apache airflow work?
A data pipeline in airflow is written using a Direct Acyclic Graph (DAG) in the Python Programming Language. By drawing data pipelines as graphs, airflow explicitly defines dependencies between tasks. In DAGs, tasks are displayed as nodes, whereas dependencies between tasks are illustrated using direct edges between different task nodes.

The direction of the edges depicts the direction of the dependencies, with an edge pointing from one task to another. Indicate which task must be completed before moving on to the next.
A DAG is defined in Apache airflow using Python code. The Python file describes the correlated DAG's structure. As a result, each DAG file typically describes the various types of tasks for a given DAG, as well as the dependencies of the various tasks. Apache Airflow parses these to create the DAG structure. Furthermore, DAGs Airflow files include additional metadata that instructs airflow when and how to execute the files.

The benefit of defining Airflow DAGs with Python code is that the programmatic approach gives users a lot of flexibility when building pipelines. Users, for example, can use Python code to generate a dynamic pipeline based on certain conditions. The adaptability allows for great workflow customization, allowing users to tailor Airflow to their specific requirements.

Scheduling and executing data pipelines in apache airflow
After defining the structure of a data pipeline as DAGs, Apache Airflow allows the user to specify a scheduled interval for each DAG. The schedule dictates when Airflow runs a pipeline. As a result, users can instruct Airflow to run every week, day, or hour. Alternatively, you can define even more complex schedule intervals to deliver the desired workflow output.

To better understand how Airflow runs DAGs, we must first examine the overall process of developing and running DAGs.

Components of Apache Airflow

The Airflow Scheduler is in charge of parsing DAGs, checking their schedule, monitoring their intervals, and scheduling DAG tasks for Airflow Workers to process if the schedule has passed.
The Airflow Workers are in charge of picking up and carrying out tasks.
The Airflow Webserver is used to visualize pipelines that are being run by the parsed DAGs. The web server also serves as the primary Airflow UI (User Interface), allowing users to track the progress of their DAGs and results.

Airflow Operators are the foundation of Airflow DAGs. They contain the logic for how data in a pipeline is processed. A DAG task is defined by instantiating an operator.
In Airflow, there are many different types of operators. Some operators, such as Python functions, run general code supplied by the user, whereas others perform very specific actions, such as data transfer from one system to another.

Some of the most commonly used Airflow operators are as follows:

PythonOperator: This class runs a Python function.
BashOperator: This class runs a bash script.
PythonVirtualenvOperatordecorator: to execute Python callables inside a new Python virtual environment. The virtualenv package needs to be installed in the environment that runs Airflow

XComs(short for "cross-communications") is a mechanism that allows Tasks to communicate with one another, as Tasks are normally isolated and may run on different machines.
An XCom is identified by a key (basically its name), as well as the task id and dag id from which it originated. They can have any (serializable) value, but they are only intended for small amounts of data; do not use them to pass large values around, such as dataframes.
The xcom push and xcom pull methods on Task Instances are used to explicitly "push" and "pull" XComs to and from storage. If the do xcom push argument is set to True, many operators will auto-push their results into an XCom key called return value.

"""Example DAG demonstrating the usage of XComs."""
from airflow import DAG
from airflow.operators.python import PythonOperator
from airflow.utils.dates import days_ago

dag = DAG(
    'example_xcom',
    schedule_interval="@once",
    start_date=days_ago(2),
    default_args={'owner': 'airflow'},
    tags=['example'],
)

value_1 = [1, 2, 3]
value_2 = {'a': 'b'}


def push(**kwargs):
    """Pushes an XCom without a specific target"""
    kwargs['ti'].xcom_push(key='value from pusher 1', value=value_1)


def push_by_returning(**kwargs):
    """Pushes an XCom without a specific target, just by returning it"""
    return value_2


def puller(**kwargs):
    """Pull all previously pushed XComs and check if the pushed values match the pulled values."""
    ti = kwargs['ti']

    # get value_1
    pulled_value_1 = ti.xcom_pull(key=None, task_ids='push')
    if pulled_value_1 != value_1:
        raise ValueError(f'The two values differ {pulled_value_1} and {value_1}')

    # get value_2
    pulled_value_2 = ti.xcom_pull(task_ids='push_by_returning')
    if pulled_value_2 != value_2:
        raise ValueError(f'The two values differ {pulled_value_2} and {value_2}')

your first airflow DAG using tweepy
Using airflow and tweepy to show trending hashtags every 20 minutes. First, create a new folder called tweepy_airflow.
Run this command: curl -LfO 'https://airflow.apache.org/docs/apache-airflow/2.3.3/docker-compose.yaml to download the docker compose file. The file contains several services definitions, including airflow-scheduler, airflow-webserver, airflow-worker, airflow-init, postgres, and redis

add data volume to the docker-compose file

    - ./dags:/opt/airflow/dags
    - ./logs:/opt/airflow/logs
    - ./plugins:/opt/airflow/plugins
    # volume to store data
    - ./data:/opt/airflow/data

initialize airflow and run the dags
Ensure docker is running then run docker-compose up airflow-init . Dags are contained/put in the dags folder. After airflow is initialized start services ->run docker-compose up use docker ps to confirm that the services are running then navigate to localhost:8080

tweepy_dag.py (trending_hashtags dag)
The dag that contains functions to connect to twitter.The dag uses PythonVirtualvenvOperator to create and activate the venv that uses tweepy module. All the operations are wrapped in one callable function to avoid Xcom issues.

entire code for the dag:

#importing libraries
from datetime import timedelta
import airflow
import os 

#from app import tweepy
from airflow import DAG
from airflow.operators.python import PythonOperator # for executing python functions
from airflow.operators.python import PythonVirtualenvOperator # for working with venvs airflow

# function to get twitter API, get trending topics for Nairobi and log them
def get_api():
    # importing a package in the function ensures that it is accessible when the venv is created
    import tweepy
    # api credentials - input yours
    consumer_key = " "
    consumer_secret = " "
    access_token = " "
    access_token_secret = " "

    # authentication of consumer key and secret
    auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
    # authentication of access token and secret
    auth.set_access_token(access_token, access_token_secret)
    api = tweepy.API(auth, wait_on_rate_limit=True)
    print("successfuly activated virtual env and connected to twitter API")
    # coordinates for Nairobi city
    lat= 1.2921
    long= 36.8219
    # methods to get trends- tweepy==4.6.0
    closest_loc = api.closest_trends(lat, long)
    trends = api.get_place_trends(closest_loc[0]["woeid"])
    trends_ = trends[0]["trends"]
    hashtags = [trend["name"] for trend in trends_ if "#" in trend["name"]]
    # print hashtags
    for elem in hashtags:
        print(elem) 


def message():
    print("task complete")

default_args = {
 'owner':'airflow',
 'depends_on_past' : False,
 'start_date': airflow.utils.dates.days_ago(7),
}

trending_hashtags = DAG(
    'trending_hashtags', #name of the dag
    default_args = default_args,
    schedule_interval = timedelta(minutes=20),
    catchup = False
)

get_api_2 = PythonVirtualenvOperator(
    task_id='connecting_to_twitter_api',
    python_callable = get_api,
    requirements = ["tweepy"],
    system_site_packages=False,
    dag = trending_hashtags,
)

message_out = PythonOperator(
    task_id = 'task_complete_message',
    python_callable = message,
    dag = trending_hashtags,
)

get_api_2 >> message_out

Airflow web UI
Using the web UI access the trending hashtags dag unpause and trigger the dag. Use any of theview options to monitor the progress of dag. check logs for output and possibe errors.

use docker-compose down to stop the services

Airflow is becoming increasingly popular among software companies, financial institutions, game developers, and others. Whether you are a technical professional or not, libraries like DAGFactory make Airflow available to professionals across the enterprise, not just data engineers. Airflow DAGs are composed of tasks that execute various operations via Operators, leveraging the power of Python, Bash, HTTP, and database functionality.

You can take your data pipeline strategy to the next level with this innovative tool, orchestrating it across the most complex environments.

Python ETL: Creating and automating a pipeline from Mysql to postgresql

James — Tue, 11 Oct 2022 12:56:53 +0000

What is ETL?
In data processing, extract, transform, load (ETL) is a three-phase process in which data is extracted, transformed (cleaned, cleaned, cleaned), and loaded into an output data container. Data can be collected from one or more sources and output to one or more destinations.ETL software typically automates the entire process and can be run manually or on a regular schedule as individual jobs or aggregated into batches of jobs.

ETL tools break down data silos and make it easier for data scientists to access, analyze, and transform data into business intelligence. In short, an ETL tool is a critical first step in the data warehousing process that ultimately enables you to make more informed decisions in less time.

Python for ETL
Python is a relatively easy programming language to learn and use. Python has an extremely active open source community on GitHub, regularly releasing new Python libraries and improvements.In recent years, Python has become a popular programming language choice for data processing, data analysis, and data science (especially with the powerful Pandas data science library).

python libraries useful in ETL

Pandas uses a dataframe as a data structure to hold data in memory (similar to how data is handled in the R programming language) Besides the usual ETL features, Pandas supports many analytical features and data visualization.
Apache Airflow is an open source workflow management tool. It can be used to create data ETL pipelines. Strictly speaking, it is not an ETL tool, but a orchestration tool that can be used to create, schedule, and track workflows. This means you can use Airflow to create a pipeline by merging different modules written independently from your ETL process. The airflow workflow follows the concept of DAG (Directed Acyclic Graph). Airflow, also features a browser-based dashboard to visualize workflows and track execution of multiple workflows.
Pyspark performs fast processing of huge amounts of data. So, if you are looking to create an ETL pipeline for very fast big data processing or processing data streams, you should definitely look into Pyspark.
Luigi is a Python-based ETL engine created by Spotify but now available as an open source tool. It is a more complex tool and has powerful features for creating complex ETL pipelines. It handles dependency resolution, workflow management, visualization, error handling, command line integration, and more.It also comes with a web dashboard to track all ETL jobs

Example of a pipeline from Mysql to postgresql db using python

This pipeline extracts data from a mysql database. The data is transformed in python using pandas library then loaded to analytical database in postgres db. This task is scheduled using the windows task scheduler(Updates the target database every 5 minutes)

Install mysql - Installing MySQL on Microsoft Windows

Install postgresql - Install PostgreSQL on Windows

The source database contains a classicmodels db which is a database for a car retail company.
Queries to create against the source database:

customers who made the most orders
products that have the highest number of purchases
customers who have spent more

Create the analytical database CREATE DATABASE classic_model_analysis to which the data will be loaded.

Creating a connection with target and source database

import psycopg2
import mysql.connector

# source database
def get_conn_mysql():
    conn = mysql.connector.connect(host="localhost", port=3306, user="root", password="", db="classicmodels")
    # start a connection
    cur = conn.cursor()
    return cur, conn
# target database
def get_conn_postgresql():
    conn = psycopg2.connect(host="localhost",database="classicmodels_analysis",user="postgres",password="")
    # start a connection
    cur = conn.cursor()
    return cur, conn

Creating and updating tables for the target database

import db_connect
import pandas as pd
import psycopg2
import psycopg2.extras as extras

# ignore user warnings
import warnings
warnings.simplefilter("ignore", UserWarning)

cur1, conn1 = db_connect.get_conn_mysql()
cur2, conn2 = db_connect.get_conn_postgresql()

# creating and updating tables for the target database
commands = (
        """
        CREATE TABLE IF NOT EXISTS toporders(
            customername VARCHAR(255),
            number_of_orders INTEGER
        )
        """,
        """ CREATE TABLE IF NOT EXISTS product_demand(
                productName VARCHAR(255),
                quantity_ordered INTEGER
                )
        """,
        """
        CREATE TABLE IF NOT EXISTS customer_spending(
            customername VARCHAR(255),
            total_amount_spent float8
        )
        """,
        """
        TRUNCATE TABLE toporders, product_demand, customer_spending;
        """
        )
# executing the queries against the target database
for command in commands:
    cur2.execute(command)
print("--------- tables updated ----------")
# commit schema changes
conn2.commit()

Extracting data from Source database and performing simple transformations

# extracting data from the source database-views
#product demand-products with highest purchases
query1="SELECT productName , SUM(quantityOrdered) AS quantity_ordered\
       FROM  products, orderdetails\
       WHERE products.productCode = orderdetails.productCode\
       GROUP BY productName\
       ORDER BY quantity_ordered DESC\
       LIMIT 20;"

# toporders- customers that have the most orders
query2="SELECT contactFirstName, contactLastName , COUNT(*) AS number_of_orders\
       FROM  customers, orders\
       WHERE customers.customerNumber = orders.customerNumber\
       GROUP BY customerName\
       ORDER BY number_of_orders DESC\
       LIMIT 20;"
# customer spending- ustomers that have spent more
query3="SELECT contactFirstName , contactLastName, SUM(quantityOrdered*priceEach) AS total_amount_spent\
       FROM  customers, orders, orderdetails\
       WHERE customers.customerNumber = orders.customerNumber AND orderdetails.orderNumber= orders.orderNumber\
       GROUP BY customerName\
       ORDER BY total_amount_spent DESC\
       LIMIT 10;"

# creating dataframes from the queries
df1= pd.read_sql(query1,con=conn1)
df2= pd.read_sql(query2,con=conn1)
df3= pd.read_sql(query3,con=conn1)

# performing some transformations on the dataframes- joining columns for first name and last name
df2['customername'] = df2['contactFirstName'].str.cat(df2['contactLastName'],sep=" ")
df2=df2.drop(['contactFirstName','contactLastName'], axis=1)
df3['customername'] = df3['contactFirstName'].str.cat(df3['contactLastName'],sep=" ")
df3=df3.drop(['contactFirstName','contactLastName'], axis=1)
# converting the datatype of quantity ordered to integer for df-product demand
data_types={'quantity_ordered':int}
df1 = df1.astype(data_types)

Loading data into the target database

# loading the df into the target database tables
def execute_values(conn, df, table):

    tuples = [tuple(x) for x in df.to_numpy()]

    cols = ','.join(list(df.columns))
    # SQL query to execute
    query = "INSERT INTO %s(%s) VALUES %%s" % (table, cols)
    try:
        extras.execute_values(cur2, query, tuples)
        conn2.commit()
    except (Exception, psycopg2.DatabaseError) as error:
        print("Error: %s" % error)
        conn.rollback()
        return 1
    print("-------data updated/inserted into table----",table)

execute_values(conn2, df1,'product_demand')
execute_values(conn2, df2,'toporders')
execute_values(conn2, df3,'customer_spending')

# close connections for mysql and postgresql
conn1.close()
conn2.close()

Below is an output from one the tables in the target database:

Automating the job with windows scheduler
This etl job is scheduled to run every 5 minutes for one day, using the windows task scheduler. schedule_python_etl.bat activates the environment and runs the python script.
to create a task in windows task scheduler: start->task scheduler->create a folder (mytask)->create task (python_etl)->trigger(repeat after 5 mins)->action(start program-schedule_python_etl.bat)

This is a simple example of a pipeline in python. Apache airflow can be used to manage this task more efficiently The entire code is available here: https://github.com/James-Wachuka/python_etl

Conclusion
Coding ETL processes in Python can take many forms, depending on technical requirements, business goals, what libraries are currently available, tools compatible with, and the extent to which developers feel they should work from scratch.Python is flexible enough that users can code almost any ETL process with native data structures.