Forem: Mage AI

Streamline data transfer: The ultimate data integration guide SFTP to BigQuery

Mage AI — Fri, 07 Mar 2025 21:15:16 +0000

Link to original article written by Mage DevRel, Cole Freeman: https://www.mage.ai/blog/streamline-data-transfer-the-ultimate-data-integration-guide-sftp-to-bigquery

TLDR

Securely move data from SFTP to Google BigQuery using Mage Pro for automated analytics. This guide walks you through setting up a pipeline, connecting to SFTP, choosing data streams, configuring replication, transforming data, connecting to BigQuery, triggering syncs, and verifying data. Automate your data flow, eliminate manual errors, and get valuable business insights!

Introduction
Business use case
Step by step implementation guide
Step 1: Creating a new pipeline
Step 2: Configuring the SFTP connection
Step 3: Selecting data streams
Step 4: Configuring data replication methods
Step 5: Transforming data
Step 6: Setting up Google BigQuery as destination
Step 7: Triggering the data sync
Step 8: Verifying data synced to Google BigQuery
Conclusion

Are you struggling with secure data transfers from SFTP to your data warehouse for analytics? This guide demonstrates how to build automated, scalable data integration pipelines using Mage to transform complex data workflows into straightforward processes that power your analytics infrastructure. In the following sections, we'll walk through setting up Mage Pro, creating your first pipeline, and establishing reliable data synchronization to Google BigQuery.

Business use case

Many companies face a common challenge when managing transaction data from multiple systems, even multiple vendors. These organizations regularly receive sensitive financial information via SFTP as CSV files while their analytics teams require this data in a centralized warehouse like Google BigQuery for analyzing important business operations. Without automation, the process of transferring data from SFTP servers to analytics platforms creates multiple pain points. Manual transfers:

Consume valuable time;
Introduce security vulnerabilities when handling sensitive information;
Llead to data inconsistencies that impact analysis quality.

Implementing an automated SFTP to BigQuery pipeline addresses these challenges by streamlining the entire data flow. The business benefits are substantial, they:

Reduce processing time;
Eliminate of human error in transfers;
Produce consistent application of data transformations;
Improve monitoring capabilities;
Better data availability for analytics teams.

This integration is particularly valuable for any organization that regularly receives data via SFTP but needs that information in a modern data warehouse to power analytics, reporting, and business intelligence. By creating a reliable pipeline between these systems, companies can focus less on data logistics and more on extracting valuable insights that drive business decisions. Let’s get started building an integration pipeline in Mage Pro.

Step by step implementation guide

To begin, you'll want to log into your Mage Pro account, and once logged in, you'll land on the homepage. On the homepage, look for the navigation menu. Hover over it, and you’ll see several options. Your next step is to find the button labeled "Pipelines." Click on it to be directed to the pipelines page where all your data integration workflows are managed.

Step 1: Creating a new pipeline

Once you are on the pipelines page, it’s time to create a new pipeline. Click on the green "New pipeline" button. You will be prompted to choose how you want to start your pipeline configuration. For this tutorial, select the option to "Start from Scratch."

Next, choose "Data Integration" as the type of pipeline you want to create. Name your pipeline something descriptive to easily identify it later. After naming it, hit the "Create new pipeline" button to proceed.

Step 2: Configuring the SFTP connection

After creating your pipeline, you will enter the pipeline editor. Here, you’ll need to set up the SFTP connection. Click into the Select source dropdown and then scroll down the list until you see SFTP and click on it

This action will populate a YAML configuration file in the editor. You’ll need to fill in the necessary details for the SFTP connection, including the host, port, username, and password. Additionally, specify the table name to pull data, and set the folder prefix to "export" including any subfolder specifications if necessary.

Once you've entered the connection details, it's crucial to test the SFTP connection to ensure everything is set up correctly. Look for the "Test Connection" button and click it. If everything is configured properly, you should receive a positive confirmation of the connection.

Step 3: Selecting data streams

With a successful connection established, it's time to select the data streams you want to synchronize. Mage will display available streams from your SFTP source. Look for the stream corresponding to "fetch golf player data" and check the box next to it to confirm your selection.

After confirming the stream, allow a moment for Mage to load all column data associated with the selected stream. This data will include metadata and actual CSV content, which is essential for setting up your data integration.

Step 4: Configuring data replication methods

Now that your data stream is selected, you can configure how you want to replicate this data into Google BigQuery. You have two primary options: loading a full table or incrementally loading new data. For this example, we will choose the "Full Table" option, as we are working with a single file.

Additionally, you can set unique conflict methods that dictate how Mage should handle records with duplicate values. You may choose to update existing records or ignore them. It’s essential to determine this based on your data requirements.

Moreover, you can provide bookmarks to track sync processing, which is particularly useful for incremental loading. Finally, you may define a key property to create a primary key for your destination table if required.

Step 5: Transforming data

Transforming data may be necessary to meet specific analytics requirements. In Mage, you can access the transformation settings through the "Transformer" button. This feature allows you to apply various transformations to your data before it’s sent to Google BigQuery. Define any transformations needed, such as data calculations, aggregations, or filtering.

Step 6: Setting up Google BigQuery as destination

Next, we need to configure Google BigQuery as the destination for your data. In the pipeline editor, select BigQuery from the destination options. This action will generate another YAML configuration file.

Within this configuration, you will need to provide the path to your credentials file. This file is crucial as it authenticates your access to Google BigQuery. You can download the credentials in JSON format from your Google Cloud Platform account.

Additionally, specify your project ID and dataset name. For example, if your dataset is called "golf rankings," enter that in the configuration. Ensure that your location settings are correct, typically set to "US" for the United States.

With your destination configured, it’s time to test the data sync connection. Click the "Test Connection" button. A successful test indicates that your pipeline can communicate with Google BigQuery effectively.

Step 7: Triggering the data sync

Once your data integration pipeline is fully configured and tested, it’s time to trigger the data sync. Navigate to the "Triggers" section in the left-hand menu. Here, you can create a new trigger or run an existing one.

For this demonstration, simply select the "Run One" trigger. This will initiate the data transfer process from your SFTP server to Google BigQuery. During this process, you can monitor the run's progress, and Mage provides detailed logging to help diagnose any potential issues.

Step 8: Verifying data synced to Google BigQuery

After the sync process completes, it’s important to verify that the data has been correctly loaded into Google BigQuery. Refresh your BigQuery console and navigate to the designated dataset, such as "golf rankings." You should see a new table, in this dataset where you can run a new query. Simply run a Count (*) query on the new table to ensure the expected data synced to BigQuery. This query will return the number of records imported, confirming that your data transfer was successful.

Conclusion

In this guide, we have covered the entire process of automating data integration from an SFTP server to Google BigQuery using Mage. From setting up your SFTP connection to verifying the data in BigQuery, each step is crucial for creating a reliable data pipeline.

By following these steps, you've created a scalable and efficient workflow that ensures only new or updated records are processed, saving you both time and resources. With your data now integrated, you can focus on analyzing it to gain insights that drive your business forward.

Looking to try out Mage Pro? Book your demo today!

We're back and better than ever!

Mage AI — Fri, 07 Mar 2025 19:29:01 +0000

We've been quiet, but we've been busy. We've recently launched Mage Pro, our new managed service and data platform for data engineers!

What's new in Mage Pro?

Build, deploy, and run data pipelines in minutes
Scale instantly to handle any data volume
New deployment regions: US East and Australia
AI assistant for quick answers and troubleshooting
Improved version control and file comparison
VS Code integration for seamless development
Enhanced search and filtering for pipeline runs

Our open source project is still active. We've added:

Custom Python source for streaming pipelines
OracleDB exporter for batch pipelines
Customizable server logging templates

Why choose Mage Pro?

Create complex workflows easily
Combine multiple data sources
Transform data using Python, R, or SQL
Deliver results when and where needed

Mage Pro offers over 100 API endpoints and granular security settings. You can tailor it to your needs.

Ready to upgrade your data engineering? Try Mage Pro today: https://www.mage.ai/pro

Recent updates: https://www.mage.ai/updates

Demo: https://www.mage.ai/getdemo

What will you build with Mage Pro?

Understanding DBT (Data Build Tool): An Introduction

Mage AI — Tue, 29 Aug 2023 05:39:29 +0000

Guest blog by Shashank Mishra, Data Engineer @ Expedia

TLDR

DBT (Data Build Tool) is an open-source software tool that enables data analysts and engineers to transform and model data in the data warehouse. It simplifies the ETL process by focusing on the ‘T’ — transformation — and integrates seamlessly with modern cloud-based data platforms.

Outline

Overview of DBT
Core principles of DBT
DBT architecture
Challenges with DBT
Conclusion

Overview of DBT

DBT (Data Build Tool) is an open-source tool that has revolutionized the way data analysts and engineers view and handle data transformation and modeling in the modern data stack. Here’s an overview of DBT:

Philosophy:

Focuses on the ELT (Extract, Load, Transform) approach, leveraging modern cloud data warehouses. Core Components:
Models: SQL queries that define data transformations.
Tests: Ensure data quality by validating models.
Snapshots: Track historical changes in data.
Documentation: Auto-generates documentation for clarity on data processes. Development Workflow:
Developer-centric with version control (typically Git), branching, and pull requests. Execution:
Compiles models into SQL and runs them directly on data warehouses like Snowflake, BigQuery, and Redshift. Adapters:
Makes DBT versatile by connecting to various databases and data platforms.

(Source: Giphy)

Core principles of DBT

DBT (Data Build Tool) operates on a set of core principles that guide its philosophy and approach to data transformation and modeling:
Data Warehouse-Centric:

Raw data is ingested into the data warehouse, using its computational capabilities for in-database transformations. This principle capitalizes on modern warehouses like Snowflake, BigQuery, or Redshift for heavy computations. ELT Workflow:
Instead of pre-transforming data (ETL), DBT supports ELT, where raw data is loaded into the data warehouse (Extract, Load) and then transformed using SQL-based models (Transform). SQL as the DSL:
DBT uses SQL as its domain-specific language. This eliminates the need for proprietary transformation languages or GUI-based ETL tools, providing direct and transparent transformation logic. Git-based Version Control:
DBT projects are typically version-controlled using Git, allowing for branch-based development, commit histories, and collaboration through pull requests. Model Dependencies:

Models, written in SQL, can reference other models (ref() function). This creates a DAG (Directed Acyclic Graph) of dependencies, which DBT uses to run models in the correct order.
Data Testing:

DBT’s schema tests (e.g., unique, not_null, accepted_values) validate the integrity of the transformed data. Custom data tests can also be written in SQL to enforce specific business rules or constraints.
Jinja Templating:

DBT uses the Jinja2 templating engine. This allows for dynamic SQL code generation, loops, conditional logic, and macro creation for reusable SQL snippets.
CLI and API Integration:

DBT’s command-line interface (CLI) supports operations like run, test, and docs generate. It can also be integrated with CI/CD tools and other platforms through APIs.
Configurations & Hooks:

Technical configurations can be set at the project, model, or global level (dbt_project.yml). Pre- and post-hooks allow for operations (like data quality checks or audit trails) to be executed before or after a model runs.
Extensibility with Adapters:

DBT’s architecture allows for custom adapters. While it comes with adapters for popular data platforms, the community or organizations can develop adapters for other platforms, ensuring wide compatibility.
By emphasizing these technical principles and functionalities, DBT provides a powerful and flexible framework for data engineers and analysts to manage data transformations with precision and efficiency.

(Source: Giphy)

DBT architecture

DBT (Data Build Tool) employs a unique architecture that sets it apart from traditional ETL tools and frameworks. At its core, DBT is a command-line tool that uses SQL and Jinja2 templating to transform and model data. Let’s break down its architecture:

Command-Line Interface (CLI):

Central Control: DBT is primarily operated through its command-line interface, allowing users to run commands for transformations (dbt run), testing (dbt test), and documentation generation (dbt docs generate).
SQL + Jinja2 Templating:

Dynamic SQL Generation: By combining SQL with the Jinja2 templating engine, DBT allows for dynamic SQL code generation. This lets users incorporate loops, conditional logic, and macros into their transformation logic.
Projects and Configuration:

DBT Project: The foundational unit in DBT. It contains models, tests, snapshots, macros, and the essential dbt_project.yml configuration file.
Configuration Files: These YAML files (dbt_project.yml, profiles.yml, etc.) define project details, model configurations, and database connections.
Models & Directed Acyclic Graph (DAG):

Models: SQL files that represent the transformation logic.
DAG: DBT builds a DAG of model dependencies using the ref() function in models. The DAG determines the execution order when running transformations.
Adapters:

Database Compatibility: DBT uses adapters to connect and interface with different data platforms, like Snowflake, BigQuery, and Redshift. Adapters translate DBT’s generic SQL into database-specific SQL.
Testing Framework:

Built-in & Custom Tests: DBT supports both built-in tests (like unique or not_null) and custom tests defined in SQL, ensuring data quality and conformity to business rules.
Version Control Integration:

Git Integration: DBT projects are typically stored in Git repositories, enabling collaboration, versioning, and branching.
Documentation:

Auto-generation: DBT automatically generates a web-based documentation portal that visualizes model metadata, lineage, and descriptions.
Plugins and Extensibility:

Community Plugins: DBT’s architecture allows for extensions, and the community has contributed various plugins, adding functionality and compatibility with other tools.
Runtime Environment:

In-database Computation: Unlike ETL tools that may have their own computation engines, DBT compiles and runs SQL directly in the target data warehouse, leveraging its computational power for transformations.

(Source: Giphy)

Challenges with DBT

While DBT (Data Build Tool) has gained substantial popularity due to its approach to data transformation, it is not without its technical challenges, especially when viewed in the context of the broader data pipeline design:

Initial Data Ingestion:

DBT focuses mainly on the transformation (T) part of the ELT process. The extraction (E) and load (L) phases are out of its scope, requiring other tools or manual setups to ingest data into the data warehouse.
Complex Dependency Management:

As DBT projects grow, managing model dependencies (DAG) can become complex. Ensuring models run in the right order without causing circular dependencies is crucial and can be challenging in large projects.
Performance Considerations:

Relying on the computational power of the data warehouse for transformations can lead to increased costs, especially if not optimized.
Some transformations might be less efficient in SQL compared to other data processing languages or tools.
Concurrency and Parallelism:

Handling concurrent DBT runs or ensuring that parallel transformations don’t interfere with each other can be challenging. There’s a need to fine-tune data warehouse configurations and manage resource contention.
Incremental Processing:

While DBT supports incremental models, designing them effectively requires careful consideration to ensure data integrity and avoid data duplication.
Real-time Data Processing:

DBT is batch-oriented by design. Real-time or near-real-time data processing pipelines might need additional tools or configurations outside of DBT’s standard capabilities.
Integration with External Tools:

DBT’s ecosystem is primarily SQL-focused. Integrating with non-SQL tools or platforms might require additional effort or custom plugins.
Operational Monitoring and Alerting:

Out-of-the-box, DBT does not provide comprehensive monitoring or alerting mechanisms for transformations. Integration with monitoring tools or building custom alert systems might be necessary.
Error Handling:

Granular error handling, especially for non-fatal issues, can be complex. DBT will fail a run if a model encounters an error, requiring manual intervention or a robust orchestration tool to manage failures.
Security and Compliance:

Ensuring that DBT processes adhere to data governance, security, and compliance requirements might necessitate additional configurations, especially when working with sensitive data.
Scalability:

As data volume grows, some DBT models might need refactoring or optimization to maintain performance. This requires ongoing maintenance and tuning.

(Source: Giphy)

Conclusion

In the ever-evolving landscape of data processing and analytics, DBT emerges as a powerful tool that merges software engineering best practices with data operations. Its ELT-centric approach, modular design, and emphasis on code and collaboration make it an attractive solution for modern data teams.

Yet, like any tool, it is not without its challenges. Factors like dependency management, real-time processing, and scalability require thoughtful consideration in the broader context of data pipeline design.

With proper planning and awareness of its intricacies, DBT can be a pivotal element in a data team’s toolkit, driving efficiency, transparency, and reliability in data transformations. As with all tools, a balance of its strengths against its challenges is essential in leveraging its full potential effectively.

Link to the original blog: https://www.mage.ai/blog/understanding-dbt-data-build-tool-an-introduction

Data Integration: Google BigQuery with Mage

Mage AI — Fri, 21 Jul 2023 19:01:33 +0000

Guest blog by Shashank Mishra, Data Engineer @ Expedia

TLDR

This article outlines the integration between Mage and Google BigQuery, a serverless data warehousing service. We’ll discuss the integration process, its benefits, and how it aids businesses in making data-driven decisions.

Outline

Introduction to Mage
Overview of Google BigQuery
Step by step process to integrate Google BigQuery with Mage
Conclusion

Introduction to Mage

In an age where data is the new oil, efficient and reliable data management tools are essential. Mage is a platform committed to simplifying data integration and analytics. Designed for seamless data transformation and loading, Mage is transforming how businesses approach data management. Here are its key features:

Automated Data Pipeline: Mage automates data extraction, transformation, and loading (ETL) processes. It can extract data from multiple sources, transform it to a desirable format, and load it into a data warehouse.
Data Connectors: Mage offers various data connectors to widely-used data sources like Shopify, Facebook Ads, Google Ads, Google Analytics, etc. This makes it easier to import data from these platforms.
Easy Integration: Mage provides easy integration with popular data warehouses including Google BigQuery, Amazon Redshift, and Snowflake.
Pre-built SQL Models: Mage comes with pre-built SQL models for popular e-commerce platforms like Shopify and WooCommerce. These models simplify the process of data analysis.
Incremental Loading: Mage supports incremental loading, which means only new or updated data is loaded into the data warehouse. This saves storage space and improves efficiency.
Data Transformations: Mage performs automatic data transformations, converting raw data into a more usable format. This process makes the data ready for analysis and reporting.
Scheduled Refresh: Data refreshes can be scheduled in Mage, ensuring that the data in the warehouse is always up-to-date.
Data Security: Mage places a high emphasis on data security, ensuring data privacy and compliance with GDPR and other data protection regulations.

(Source: Giphy)

Overview of Google BigQuery

Google BigQuery is a highly scalable, serverless data warehouse offered by Google as part of its Google Cloud Platform (GCP). It is designed to streamline and simplify the processing of big data.

Serverless Architecture: BigQuery operates on a serverless model, which means users don’t need to manage any servers or infrastructure. This means you can focus more on analysis and less on maintenance. It allows you to query massive datasets in seconds and get insights in real-time, without needing to worry about resource provision.
Real-Time Analytics: BigQuery is engineered for real-time analytics. It allows users to analyze real-time data streams instantly. With its ability to run SQL queries on petabytes of data, it delivers speedy results on real-time data analytics, enabling businesses to make timely decisions.

Google BigQuery, with its serverless architecture and real-time analytics, serves as a robust platform to handle, analyze, and draw insights from massive datasets with ease.

(Source: Giphy)

Step by step process to migrate Google BigQuery with Mage

Before we begin, we’ll need to create a service account key. Please read Google Cloud’s documentation on how to create that.

Once we are finished, following these steps:

Create a new pipeline or open an existing pipeline.
Expand the left side of the screen to view the file browser.
Scroll down and click on a file named io_config.yaml
Enter the following keys and values under the key named default (we can have multiple profiles, add it under whichever is relevant for us)
Note: we only need to add the keys under GOOGLE_SERVICE_ACC_KEY or the value for key GOOGLE_SERVICE_ACC_KEY_FILEPATH (both are not simultaneously required).

version: 0.1.1
default:
  GOOGLE_SERVICE_ACC_KEY:
    type: service_account
    project_id: project-id
    private_key_id: key-id
    private_key:
      "-----BEGIN PRIVATE KEY-----\nyour_private_key\n-----END_PRIVATE_KEY"
    client_email: your_service_account_email
    auth_uri: "https://accounts.google.com/o/oauth2/auth"
    token_uri: "https://accounts.google.com/o/oauth2/token"
    auth_provider_x509_cert_url: "https://www.googleapis.com/oauth2/v1/certs"
    client_x509_cert_url: 
"https://www.googleapis.com/robot/v1/metadata/x509/your_service_account_email"
  GOOGLE_SERVICE_ACC_KEY_FILEPATH: "/path/to/your/service/account/key.json"

Using SQL block

Create a new pipeline or open an existing pipeline.
Add a data loader, transformer, or data exporter block.
Select SQL.
Under the Data provider dropdown, select BigQuery
Under the Profile dropdown, select default (or the profile we added credentials underneath).
Next to the Database label, enter the database name we want this block to save data to.
Next to the Save to schema label, enter the schema name we want this block to save data to.
Under the Write policy dropdown, select Replace or Append (please see SQL blocks guide for more information on write policies).
Enter in this test query: SELECT 1
Run the block.

Using Python block

Create a new pipeline or open an existing pipeline.
Add a data loader, transformer, or data exporter block (the code snippet below is for a data loader).
Select Generic (no template).
Enter this code snippet (note: change the config_profile from default if we have a different profile):

from mage_ai.data_preparation.repo_manager import get_repo_path
from mage_ai.io.bigquery import BigQuery
from mage_ai.io.config import ConfigFileLoader
from os import path
from pandas import DataFrame

if 'data_loader' not in globals():
    from mage_ai.data_preparation.decorators import data_loader

@data_loader
def load_data_from_big_query(**kwargs) -> DataFrame:
    query = 'SELECT 1'
    config_path = path.join(get_repo_path(), 'io_config.yaml')
    config_profile = 'default'

    return BigQuery.with_config(ConfigFileLoader(config_path, config_profile)).load(query)

Run the block.

(Source: Giphy)

Conclusion

Integrating Mage with Google BigQuery provides your team with a potent combination of automated data pipeline management and robust data warehousing. This partnership not only simplifies data extraction, transformation, and loading but also provides a seamless pathway for data analysis and insight generation. As we’ve demonstrated in this step-by-step guide, the integration process is straightforward, making it an accessible option for businesses of all sizes. By leveraging this integration, you can unlock the full potential of your data, streamline operations, and drive data-informed decisions.

Link to the original blog: https://www.mage.ai/blog/data-integration-google-bigquery-with-mage

Google BigQuery: Serverless data warehousing made simple

Mage AI — Fri, 21 Jul 2023 18:44:08 +0000

Guest blog by Shashank Mishra, Data Engineer @ Expedia

TLDR

Google BigQuery is a serverless, scalable data warehouse on Google Cloud. It supports real-time analytics, machine learning, and GIS capabilities. With its unique architecture separating storage and computing, it offers automatic scalability and strong security, ideal for data-driven businesses.

Outline

Introduction to Google BigQuery
Key features of Google BigQuery
BigQuery’s Unique Architecture
Benefits of Using BigQuery
Conclusion

Introduction to Google BigQuery

Google BigQuery is a highly scalable, serverless data warehouse offered by Google as part of its Google Cloud Platform (GCP). It is designed to streamline and simplify the processing of big data.

Serverless Architecture: BigQuery operates on a serverless model, which means users don’t need to manage any infrastructure or do any server management. This helps in focusing more on data analysis rather than worrying about capacity planning or server management. It allows you to query massive datasets in seconds and get insights in real-time, without needing to worry about resource provision.
Real-Time Analytics: BigQuery is engineered for real-time analytics. It allows users to analyze real-time data streams instantly. With its ability to run SQL queries on gigabytes to petabytes of data, it delivers speedy results on real-time data analytics, enabling businesses to make timely decisions.

In summary, Google BigQuery, with its serverless architecture and real-time analytics, serves as a robust platform to handle, analyze, and draw insights from massive datasets with ease.

(Source: Giphy)

Key features of Google BigQuery

Google BigQuery offers a robust set of features that make it an ideal choice for businesses looking to leverage data for actionable insights. These features extend from machine learning capabilities and geospatial analytics to multi-cloud data analysis and automated data transfer services. These cutting-edge functionalities position BigQuery as a powerful tool in the data analytics landscape. Let’s dive into some of these key features:

Machine Learning Integration: Google BigQuery provides built-in machine learning capabilities, enabling data scientists to create and execute machine learning models on structured and semi-structured data directly inside BigQuery using SQL. This ML integration allows users to build models with the ease of SQL commands, eliminating the need to move data across different environments or learn a new language.
GIS Capabilities: BigQuery GIS, or Geo Viz, allows analysts to manage and analyze geospatial data in BigQuery by providing SQL geographic functions. These functions make it easier to understand spatial relationships and provide insights about geographic-based data that are critical for businesses, like determining delivery routes, analyzing service coverage areas, and much more.
BI Engine: BigQuery BI Engine is a fast, in-memory analysis service that allows users to analyze data stored in BigQuery with sub-second query response time and high concurrency. Integrated with popular tools like Google Data Studio, it enables analysts and data scientists to create interactive dashboards and reports without any performance latency.
BigQuery Omni: BigQuery Omni is a multi-cloud data analytics solution that allows users to execute BigQuery’s powerful analytics capabilities on data stored not just in Google Cloud, but also AWS and Azure. This means you can break down data silos and gain insights across different cloud platforms without having to move or copy data, enabling a truly multi-cloud data analytics approach.
BigQuery Data Transfer Service: The BigQuery Data Transfer Service automates data movement from SaaS applications to Google BigQuery on a scheduled, managed basis. This allows businesses to maintain an updated data warehouse without the hassle of writing custom scripts or manually importing data, simplifying data ingestion and ensuring that data is readily available for analysis.

In essence, Google BigQuery provides a comprehensive suite of tools and capabilities that not only simplify data warehousing tasks but also empower businesses to draw actionable insights from their data.

(Source: Giphy)

Google BigQuery’s unique architecture

At its core, Google BigQuery’s architecture is a manifestation of Google’s Dremel technology. Dremel is a highly scalable, interactive ad-hoc query system for the analysis of read-only nested data, and BigQuery utilizes this technology to execute SQL-like queries over multi-terabyte datasets in seconds.

Dremel-Inspired Architecture: BigQuery’s Dremel-inspired architecture allows it to deliver incredibly fast analytics on a petabyte scale. By creating a tree architecture for dispatching queries and aggregating results, Dremel enables BigQuery to scan trillions of rows in seconds and return results in a blink. This architecture uses a combination of columnar storage for data organization and tree architecture for query execution, allowing BigQuery to run SQL queries on large datasets swiftly.
Separation of Compute and Storage: A fundamental design principle of BigQuery is the decoupling of compute and storage. The data you store in BigQuery is kept in a multi-tenant distributed architecture, separated from the computational resources. This separation allows for nearly infinite scalability: as your data grows, BigQuery scales to meet your storage needs without any intervention, and you can ramp up query computing power as needed without being limited by your data size.
Compute Resources: When you run a query, BigQuery dynamically allocates computing resources as needed. This serverless model means that you don’t have to worry about pre-provisioning compute capacity, and you only pay for the queries you run.
Storage Layer: On the storage side, BigQuery automatically replicates data for durability and high availability. It also handles all ongoing maintenance, including patches and upgrades. Data in BigQuery is stored in Capacitor, Google’s next-generation columnar storage format, which is highly compressed and optimized for reading large amounts of structured data.

Google BigQuery’s unique architecture, inspired by Dremel, and its separation of compute and storage lead to high-speed query performance, automatic scalability, and strong data security, thereby making it an efficient data warehouse solution for businesses of all sizes.

(Source: Giphy)

Benefits of using Google BigQuery

Google BigQuery provides a number of benefits that make it a compelling choice for businesses of all sizes, from startups to large enterprises, who are looking to derive insights from their data. These benefits stem from BigQuery’s serverless architecture, automatic scalability, strong security features, and other business benefits:

Serverless Data Warehousing: As a serverless solution, BigQuery eliminates the need for businesses to manage, administer, or tune any infrastructure, saving them time and resources. This allows businesses to focus on what truly matters — deriving insights from their data and using them to make informed business decisions.
Automatic Scalability: BigQuery scales automatically to accommodate your data and workloads. Its architecture separates storage and computation, enabling each to scale independently. This ensures that the system can handle any volume of data and any number of queries while maintaining high performance.
Strong Security Features: BigQuery is designed with a robust security model that integrates with other Google Cloud security tools. It offers data encryption at rest and in transit, identity and access management, and a host of other security features that help businesses protect their sensitive data.
Business Benefits: Beyond the technical features, BigQuery offers tangible business benefits. It provides real-time insights that enable businesses to make timely decisions, improving operational efficiency and enabling new opportunities. It also reduces costs, as businesses only pay for the storage they use and the queries they run, making BigQuery a cost-effective solution for data warehousing.

(Source: Giphy)

In conclusion, Google BigQuery stands out as a robust, serverless data warehouse in the Google Cloud Platform. Its unique Dremel-inspired architecture supports immense scalability and swift, real-time analytics. With features like machine learning integration, GIS capabilities, and multi-cloud data analytics, it equips businesses to derive critical insights from massive datasets efficiently and securely. BigQuery simplifies data management, providing a potent solution for data-driven decision-making in the ever-evolving digital landscape.

In episode 4 of Datawarehouse series, we’ll explore how we can integrate Data Warehousing services like Snowflake/Redshift/Google BigQuery with Mage.

Link to the original blog: https://www.mage.ai/blog/google-bigquery-serverless-data-warehousing-made-simple

Snowflake: Revolutionizing data warehousing

Mage AI — Fri, 21 Jul 2023 18:34:25 +0000

Guest blog by Shashank Mishra, Data Engineer @ Expedia

TLDR

Snowflake is a cloud-based data warehousing platform that brings a new level of performance, simplicity, and affordability to businesses that require big data processing and analytics.

Outline

Introduction to snowflake
Key features of snowflake
Snowflake’s unique architecture
Benefits of using snowflake
Conclusion

Introduction to snowflake

Snowflake is a powerful, cloud-based data warehousing platform known for its unique, flexible architecture. By separating compute and storage resources, it offers scalable, efficient, and cost-effective data management. Snowflake eliminates the complexity of traditional data warehouses, offering a user-friendly, fully-managed solution. It supports various data formats and integrates well with diverse data processing tools and BI software. With robust security measures including encryption and role-based access control, Snowflake ensures data safety. Essentially, it empowers organizations to be data-driven, delivering a powerful and simple-to-use data warehouse solution in the cloud.

(Source: Giphy)

Key features of snowflake

Snowflake is a powerful data warehousing platform that incorporates a broad set of capabilities designed to make data storage, retrieval, and analysis more efficient, flexible, and scalable. Let’s dive into some of the prime features that make Snowflake a standout choice in the realm of cloud data platforms:

Elastic Scalability: Snowflake enables you to scale up or down instantaneously. It can handle any volume of data, the number of users, or the complexity of queries without compromising performance.
Zero Management: Snowflake is a fully-managed service that requires no management from your end, such as indexing or tuning, and it handles all infrastructure, optimization, availability, data protection, and more.
Multi-Cloud Platform: Snowflake can run on multiple clouds, including AWS, Google Cloud, and Azure. This cross-cloud capability allows businesses to leverage the advantages of different cloud providers.
Data Sharing: Snowflake allows you to share live, ready-to-query data across your organization, with partners, or even with your customers, securely and in real time.
Performance and Speed: Snowflake’s unique architecture offers excellent query performance and allows for quick data retrieval, empowering businesses with real-time insights.
Data Security: Snowflake offers robust security features, including automatic data encryption, network policies, and role-based access control to protect your data.
Support for Structured and Semi-Structured Data: Snowflake natively supports JSON, Avro, XML, ORC, and Parquet, allowing you to work with various data formats in a flexible and straightforward manner.
Time Travel: Snowflake’s Time Travel feature enables access to historical data at any point in the past, providing easy data recovery and audit capabilities.
Automatic Concurrency Scaling: During high demand, Snowflake automatically spins up additional computing resources to ensure consistent, high-speed performance for all users and queries.
In-Database Machine Learning: Snowflake supports in-database machine learning, allowing you to train models directly where your data resides, reducing data movement and improving security and efficiency.

(Source: Giphy)

Snowflake’s unique architecture
Snowflake’s architecture is a hybrid of traditional shared-disk and shared-nothing architectures with an additional layer of cloud services. This three-tier architecture consists of:

Storage Layer: The base layer of Snowflake’s architecture is the database storage layer. It manages all aspects of data storage in Snowflake.
Cloud Agnostic Storage: Snowflake can store an unlimited amount of structured and semi-structured data across multiple cloud platforms. It can run on AWS, Google Cloud, or Azure.
Automatic Organization: Data is automatically divided into micro-partitions when loaded into Snowflake. These micro-partitions are columnar and compressed for optimal storage and query performance.
Immutable Data: Once written, data in Snowflake is immutable, which provides the ability to access data at any point in time, a feature known as ‘Time Travel’.
Compute Layer: The second layer is the compute layer, known as virtual warehouses. This layer is responsible for executing queries on the data.
Elasticity and Separation of Compute and Storage: Virtual warehouses are independent compute resources that do not share CPU, memory, or storage, enabling them to scale up or down instantaneously based on workload, ensuring optimal performance without any contention.
Multi-cluster Warehouses: For large concurrent workloads, Snowflake can automatically scale out queries across multiple compute clusters to maintain performance.
Cloud Services Layer: The top layer is the cloud services layer. It coordinates and manages all aspects of Snowflake’s functionality.
Security and Access Control: This layer handles tasks such as user authentication, session management, access control, and encryption.
Metadata Management: Snowflake automatically maintains detailed metadata about all objects in the system, including data files, table structures, and data statistics.
Query Optimization and Execution: The cloud services layer optimizes and executes all SQL queries. It compiles SQL statements into low-level code that’s executed on virtual warehouses.
Transactions and ACID Compliance: Snowflake supports fully ACID-compliant transactions, ensuring data consistency and reliability. In essence, Snowflake’s unique architecture enables a highly efficient, flexible, and scalable data processing environment, making it a powerful choice for organizations seeking to leverage data for business insights.

(Source: Giphy)

Benefits of using snowflake

Snowflake offers numerous advantages that make it a highly effective solution for data warehousing. These benefits, spanning from operational efficiency to strategic decision-making, are designed to cater to both technical needs and business objectives, providing an edge in today’s data-driven landscape.

Seamless Data Integration: Snowflake integrates effortlessly with existing data management tools, ETL/ELT solutions, and business intelligence platforms. This allows organizations to continue using their preferred tools while leveraging Snowflake’s powerful data warehousing capabilities.
Multi-Cloud and Cross-Cloud Capabilities: Snowflake isn’t tied to a single cloud provider. You can use it on AWS, Google Cloud, or Azure, giving you the flexibility to choose your preferred cloud vendor, leverage multi-cloud strategies, or even migrate between them.
Disaster Recovery: The platform’s ability to replicate data across cloud regions helps in achieving a robust disaster recovery strategy, mitigating the risk of data loss and ensuring business continuity.
Democratizing Data: Snowflake empowers organizations to democratize their data by making it accessible for stakeholders across the organization. The increased availability of data for business users can drive data-driven decisions at all levels of the organization.
Collaboration and Data Exchanges: Snowflake data exchange allows organizations to share live data with their business partners, creating collaborative opportunities and enabling more informed decision-making across the business ecosystem.
Reduced Total Cost of Ownership (TCO): With its fully managed services, Snowflake reduces the need for extensive in-house data management and infrastructure, bringing down the total cost of ownership. The resources saved can be utilized for business-critical operations and innovation.
Resource Optimization: Snowflake’s separate compute and storage resources allow organizations to optimize resource usage based on their specific needs. This not only enhances performance but also results in cost savings by ensuring resources are not wasted.
Business Agility: With its robust features, scalability, and ease of use, Snowflake empowers businesses to be more agile. Organizations can rapidly adapt to changes, whether it’s increased demand, new data sources, or evolving business needs.

(Source: Giphy)

Conclusion

Snowflake is a comprehensive data warehousing solution designed for the cloud era. Its unique architecture and suite of features empower organizations to handle vast amounts of data with ease, speed, and flexibility. From effortless integration with existing tools to unparalleled scalability, and from secure real-time data sharing to cost-effective operations, Snowflake offers a transformative approach to data management. It ensures that businesses of all sizes and industries can leverage data effectively to derive valuable insights, make informed decisions, and ultimately, drive growth and innovation in an increasingly data-centric world. Whether you’re a small business looking to harness the power of data or a large enterprise aiming to optimize your data operations, Snowflake stands as a compelling choice in the realm of cloud data warehousing.

In episode 3 of data warehouse series, we’ll explore Google BigQuery.

Link to the original blog: https://www.mage.ai/blog/snowflake-revolutionizing-data-warehousing

AWS Redshift: Robust and Scalable Data Warehousing

Mage AI — Wed, 14 Jun 2023 21:58:05 +0000

Guest blog by Shashank Mishra, Data Engineer @ Expedia

TLDR

Amazon Redshift is a powerful, scalable data warehousing service within the AWS ecosystem. It excels in handling large datasets with its columnar storage, parallel query execution, and features like Redshift Spectrum and RA3 instances. Redshift’s clustered architecture, robust security, and integration with AWS services make it a go-to choice for businesses needing efficient and secure data management solutions.

Outline

Introduction to AWS Redshift
Key Features of AWS Redshift
Redshift Architecture
Benefits and Use Cases
Conclusion

Introduction to AWS Redshift

Amazon Redshift is a fully managed, petabyte-scale data warehousing service in the cloud, part of the expansive Amazon Web Services (AWS) ecosystem. As organizations today deal with astronomical amounts of data, they require efficient tools to store, retrieve, and analyze this data. Redshift is AWS’s answer to this growing need.

Designed for high-performance analysis of large datasets, Redshift allows businesses to run complex, data-heavy queries against big data sets, with results returned in seconds. It leverages columnar storage technology and parallel queries to quickly process data across multiple nodes.

The service is integrated with other AWS services, making it a natural choice for organizations already invested in the AWS infrastructure. With its scalability, speed, and integration capabilities, AWS Redshift opens the door to cost-effective big data analytics, helping businesses leverage their data for actionable insights.

(Source: Giphy)

Key features of AWS Redshift

Amazon Redshift packs a number of unique features designed to provide reliable, scalable, and fast data warehousing:

Redshift Spectrum: This feature allows users to run queries directly against vast amounts of data stored in Amazon S3. You don’t need to import or load the data, and you can use the same SQL-based interface you use for your regular Redshift queries.
Data Lake Integration: AWS Redshift can directly query and analyze data across your operational databases, data warehouse, and data lake. This gives you the ability to understand the complete picture using all your data without moving it around.
Concurrency Scaling: This feature enhances performance by adding more query processing power when you need it. As demand for data processing increases, Redshift automatically adds additional capacity to handle that demand, allowing multiple queries to run concurrently without any decrease in performance.
RA3 instances: RA3 instances let you size your cluster based primarily on your compute needs. They feature managed storage, meaning Redshift will automatically manage your data from high-performance SSDs to S3 as per the workload demand.
Advanced Data Compression: AWS Redshift employs columnar storage technology, which minimizes the amount of data read from the disk, and advanced compression techniques that require less space compared to traditional relational databases.
Data Encryption: Redshift provides robust security through automatic encryption for data at rest and in transit. By offering these key features, AWS Redshift delivers a flexible, powerful, and efficient solution for data warehousing and analytics.

(Source: Giphy)

Redshift Architecture

Amazon Redshift’s architecture is the cornerstone of its efficiency and high-speed performance when dealing with vast data volumes. Redshift utilizes a Massively Parallel Processing (MPP) data warehouse architecture, which partitions data across multiple nodes and executes queries in parallel, dramatically enhancing query performance. Here’s a deeper look at its design:

Cluster: The fundamental building block of Amazon Redshift data warehouse is a cluster. A cluster is a set of nodes, which consists of a leader node and one or more compute nodes. The number of compute nodes can be scaled up or down depending upon the processing power needed, and each node has its own CPU, storage, and RAM.
Leader Node:_ The leader node is the orchestrator of the Redshift environment. It manages communication between client applications and the compute nodes. Client applications send SQL requests to the leader node, which parses and creates optimized query execution plans. The leader node then coordinates query execution with the compute nodes and compiles the final results to send back to the client applications. This node is also responsible for managing the distribution of data to the compute nodes.
Compute Nodes:_ Compute nodes are responsible for executing the query plans received from the leader node. Each compute node scans its local data blocks and performs the operations needed by the query. Intermediate results are then sent back to the leader node for aggregation before the results are returned to the client. The compute nodes ensure the MPP (Massively Parallel Processing) architecture of Amazon Redshift.
Node Slices: Each compute node is divided into slices. The number of slices per node depends on the node size of the cluster. Each slice is allocated a portion of the node’s memory and disk space, and it operates independently of other slices. When a query is run, each slice can work on its portion of the data concurrently, which contributes to Redshift’s high query performance.
Columnar Storage: Redshift uses columnar storage, which means data is stored by column rather than by row. This can dramatically improve query speed, as it means that only the columns needed for a query are read from the disk, reducing the amount of I/O and boosting query performance.
Data Distribution: Redshift distributes the rows of a table to the compute nodes according to a key chosen when the table is created. Proper choice of this key can significantly speed up query performance by minimizing the amount of data that needs to be transferred between nodes during query execution.
Data Compression: Redshift uses various encoding techniques to compress columns of data, which can result in less disk I/O and faster query performance.

This robust and thoughtfully designed architecture allows Amazon Redshift to efficiently manage and process huge volumes of data, making it a go-to solution for organizations dealing with big data analytics.

(Source: Giphy)

Benefits and Use Cases of AWS Redshift

Benefits

Amazon Redshift provides several benefits that make it a potent choice for businesses looking to leverage their data effectively:

AWS Integration: As part of the AWS ecosystem, Redshift integrates seamlessly with other AWS services such as S3, Kinesis, and DynamoDB, which facilitates diverse data workflows.
Robust Security: Redshift provides robust security features like automatic encryption, network isolation using Amazon VPC, and robust access control policies, ensuring your sensitive data is protected.
Cost-Effectiveness: With Redshift’s ability to automatically scale resources, businesses only pay for what they need, making it a cost-effective solution. Also, Redshift’s columnar storage and data compression reduce the amount of storage needed, leading to additional cost savings.
Performance: Redshift’s columnar storage, parallel query execution, and data compression lead to high-performance data processing, allowing businesses to gain insights from their data quickly.
Scalability: Redshift allows you to start with a few hundred gigabytes of data and scale up to a petabyte or more, making it an excellent choice for businesses of all sizes.

Use Cases

Redshift is ideal for various scenarios, but it truly shines in the following:

Business Intelligence (BI) Tools: Redshift integrates well with various BI tools like Tableau, Looker, and QuickSight, enabling organizations to create visualizations and perform detailed data analysis.
Data Lake Analytics: With Redshift Spectrum, users can directly query data in an Amazon S3 data lake without having to move or transform it.
Log Analysis: Businesses can use Redshift to analyze log data and understand website user behavior, application performance, and security patterns.
Real-Time Analytics: Combined with other AWS services like Kinesis, Redshift can power real-time analytics applications.

S(Source: Giphy)

Conclusion

In conclusion, AWS Redshift offers a powerful, scalable, and secure data warehousing solution. Its robust features and benefits, combined with seamless integration within the AWS ecosystem, make it a formidable tool for businesses looking to glean valuable insights from their data. Whether it’s powering real-time analytics, driving business intelligence tools, or analyzing vast data lakes, Redshift’s potential to unlock the power of big data is immense.

In episode 2 of the Datawarehouse series, we’ll explore Snowflake.

Link to the original blog: https://www.mage.ai/blog/aws-redshift-robust-and-scalable-data-warehousing

Stream data processing with Mage

Mage AI — Tue, 13 Jun 2023 01:55:37 +0000

TLDR

Dive into the implementation of stream data processing with Mage, using Kafka as source.

Outline

Introduction to Mage
Why is kafka a popular component of streaming applications?
Step by step guide to create streaming pipeline on Mage
Conclusion

Introduction to Mage

Mage is a powerful data processing tool allowing integration and synchronization of data from third-party sources. It supports building real-time and batch pipelines using Python, SQL, and R, making data transformation simple and efficient. Moreover, it enables running, monitoring, and orchestrating thousands of pipelines, ensuring a smooth data operation without the risk of data loss or interruption.

Source: Giphy

Why is kafka a popular component of streaming applications?

Apache Kafka is an open-source stream-processing software platform developed by LinkedIn and later donated to the Apache Software Foundation. It's built on the publish-subscribe messaging system and designed to handle real-time data feeds. Kafka is essentially a distributed event log service that is fault-tolerant, highly scalable, and provides high throughput for publishing and subscribing records.

Given its robust features, Kafka is a popular component of streaming applications due to the following reasons:

Performance and Scalability: Kafka can handle real-time data feeds on a large scale, processing millions of messages per second. Its distributed architecture allows for effortless scalability.
Durability and Reliability: Kafka's distributed commit log ensures robust data persistence, safeguarding against data loss. If a node fails, the data can still be retrieved from other nodes, hence ensuring reliability.
Fault Tolerance: Kafka can handle system failures without impacting the availability of data streams, which is crucial for applications that require constant, uninterrupted access to data.
Real-time Processing: Kafka supports both batch and real-time use cases, providing developers with flexibility when creating various applications.
Integration Capabilities: Kafka can integrate with a wide range of programming languages and data systems, making it versatile for differing application needs.

Kafka's popularity stems from its high performance, reliability, fault tolerance, real-time processing, and comprehensive integration capabilities.

Source: Giphy

Step by step guide to create streaming pipeline on Mage

Dive into a comparison of Flink and Spark based on their performance benchmarks and scalability. Discover how they handle processing speed, in-memory computing, resource management, and more.

Processing Speed: Flink excels in low-latency, high-throughput stream processing, while Spark is known for its fast batch processing capabilities. Both frameworks can process large volumes of data quickly, with Flink focusing on real-time analytics and Spark catering to batch data processing tasks.
In-Memory Computing: Both Flink and Spark leverage in-memory computing, which allows them to cache intermediate results during data processing tasks. This approach significantly reduces the time spent on disk I/O operations and improves overall performance.
Resource Management: Flink and Spark can efficiently manage resources by dynamically allocating and deallocating them according to workload requirements. This enables both frameworks to scale horizontally, handling large-scale data processing tasks across multiple nodes in a distributed environment.
Adaptive Query Execution: Spark's Adaptive Query Execution (AQE) feature optimizes query execution plans at runtime, allowing it to adapt to changing data and workload characteristics. This results in improved performance and resource utilization. Flink, on the other hand, does not currently have an equivalent feature.
Backpressure Handling: Flink is designed to handle backpressure, ensuring that the system remains stable even under high loads. This is achieved through its built-in flow control mechanisms, which prevent data processing bottlenecks. Spark Streaming, in contrast, may struggle to handle backpressure, leading to potential performance degradation.
Data Partitioning: Both Flink and Spark utilize data partitioning techniques to improve parallelism and optimize resource utilization during data processing tasks. While Spark employs RDDs and data partitioning strategies like Hash and Range partitioning, Flink uses operator chaining and pipelined execution to optimize data processing performance.

Source: Giphy

Recommendations for choosing the right tool for specific use cases

Set up Kafka

Here is a quick guide on how to run and use Kafka locally.

Clone repository: git clone https://github.com/wurstmeister/kafka-docker.git
Change directory into that repository: cd kafka-docker
Edit the docker-compose.yml file to match this:

version: "2"
services:
  zookeeper:
    image: wurstmeister/zookeeper:3.4.6
    ports:
      - "2181:2181"
  kafka:
    build: .
    container_name: docker_kafka
    ports:
      - "9092:9092"
    expose:
      - "9093"
    environment:
      KAFKA_ADVERTISED_LISTENERS: INSIDE://kafka:9093,OUTSIDE://localhost:9092
      KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: INSIDE:PLAINTEXT,OUTSIDE:PLAINTEXT
      KAFKA_LISTENERS: INSIDE://0.0.0.0:9093,OUTSIDE://0.0.0.0:9092
      KAFKA_INTER_BROKER_LISTENER_NAME: INSIDE
      KAFKA_ZOOKEEPER_CONNECT: zookeeper:2181
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock

Start Docker: docker-compose up
Start a terminal session in the running container:

docker exec -i -t -u root $(docker ps | grep docker_kafka | cut -d' ' -f1) /bin/bash

Create a topic:

$KAFKA_HOME/bin/kafka-topics.sh --create --partitions 4 --bootstrap-server kafka:9092 -
topic test

List all available topics in Kafka instance:

$KAFKA_HOME/bin/kafka-topics.sh --bootstrap-server kafka:9092 --list

Start a producer on topic named test:

$KAFKA_HOME/bin/kafka-console-producer.sh --broker-list kafka:9092 --topic=test

Send messages to the topic named test by typing the following in the terminal:

>hello
>this is a test
>test 1
>test 2
>test 3

Open another terminal and start a consumer on the topic named test:

$KAFKA_HOME/bin/kafka-console-consumer.sh --from-beginning --bootstrap-server kafka:9092
--topic=test

The output should look something like this:

hello
test 1
test 3
this is a test
test 2

Setup stream data ingestion in Mage

Run the following command to run Docker in network mode:

docker run -it -p 6789:6789 -v $(pwd):/home/src \
  --env AWS_ACCESS_KEY_ID=your_access_key_id \
  --env AWS_SECRET_ACCESS_KEY=your_secret_access_key \
  --env AWS_REGION=your_region \
  --network kafka-docker_default \
  mageai/mageai /app/run_app.sh  mage start default_repo

If the network named kafka-docker_default doesn’t exist, create a new network:

docker network create -d bridge kafka-docker_default

Check that it exists:

docker network ls

If not able to connect with Kafka locally in a Docker container using Mage, in a Docker container the follow these steps:

Clone Mage: git clone https://github.com/mage-ai/mage-ai.git
Change directory into Mage: cd mage-ai
Edit the docker-compose.yml file to match this:

version: '3'
services:
  server:
    ... (original config)
    networks:
      - kafka
  app:
    ... (original config)
networks:
  kafka:
    name: kafka-docker_default
        external: true

Run the following script in terminal: ./scripts/dev.sh

This will run Mage in development mode, which runs it in a Docker container using docker compose instead of docker run.

Create streaming data pipeline

Open Mage in your browser.
Click + New pipeline, then select Streaming.
Add a data loader block, select Kafka, and paste the following:

connector_type: kafka
bootstrap_server: "localhost:9092"
topic: test
consumer_group: unique_consumer_group
batch_size: 100

By default, the bootstrap_server is set to localhost:9092, If you’re running Mage in a container, the bootstrap_server should be kafka:9093
Messages are consumed from source in micro batch mode for better efficiency. The default batch size is 100. You can adjust the batch size in the source config.
Add a transformer block and paste the following:

from typing import Dict, List

if 'transformer' not in globals():
    from mage_ai.data_preparation.decorators import transformer

@transformer
def transform(messages: List[Dict], *args, **kwargs):
    for msg in messages:
        print(msg)

    return messages

Add a data exporter block, select OpenSearch and paste the following:

connector_type: opensearch
host: https://search-something-abcdefg123456.us-west-1.es.amazonaws.com/
index_name: python-test-index

Change the host to match your OpenSearch domain’s endpoint.
Change the index_name to match the index you want to export data into.

Test pipeline

Open the streaming pipeline you just created, and in the right side panel near the bottom, click the button Execute pipeline to test the pipeline.
You should see an output like this:

[streaming_pipeline_test] Start initializing kafka consumer.
[streaming_pipeline_test] Finish initializing kafka consumer.
[streaming_pipeline_test] Start consuming messages from kafka.

Publish messages using Python

Open a terminal on your local workstation.
Install kafka-python:

pip install kafka-python

Open a Python shell and write the following code to publish messages:

from kafka import KafkaProducer
from random import random
import json

topic = 'test'
producer = KafkaProducer(
    bootstrap_servers='kafka:9093',
)

def publish_messages(limit):
    for i in range(limit):
        data = {
            'title': 'test_title',
            'director': 'Bennett Miller',
            'year': '2011',
            'rating': random(),
        }
        producer.send(topic, json.dumps(data).encode('utf-8'))

publish_messages(5)

Once you run the code snippet above, go back to your streaming pipeline in Mage and the output should look like this:

[streaming_pipeline_test] Start initializing kafka consumer.
[streaming_pipeline_test] Finish initializing kafka consumer.
[streaming_pipeline_test] Start consuming messages from kafka.
[streaming_pipeline_test] [Kafka] Receive message 2:16: v=b'{"title": "test_title",
"director": "Bennett Miller", "year": "2011", "rating": 0.7010424523477785}',
time=1665618592.226788
[streaming_pipeline_test] [Kafka] Receive message 0:16: v=b'{"title": "test_title",
"director": "Bennett Miller", "year": "2011", "rating": 0.7886308380991354}',
time=1665618592.2268753
[streaming_pipeline_test] [Kafka] Receive message 0:17: v=b'{"title": "test_title",
"director": "Bennett Miller", "year": "2011", "rating": 0.0673276352704153}',
time=1665618592.2268832
[streaming_pipeline_test] [Kafka] Receive message 3:10: v=b'{"title": "test_title",
"director": "Bennett Miller", "year": "2011", "rating": 0.37935417366095525}',
time=1665618592.2268872
[streaming_pipeline_test] [Kafka] Receive message 3:11: v=b'{"title": "test_title",
"director": "Bennett Miller", "year": "2011", "rating": 0.21110511524126563}',
time=1665618592.2268918
[streaming_pipeline_test] {'title': 'test_title', 'director': 'Bennett Miller', 'year':
'2011', 'rating': 0.7010424523477785}
[streaming_pipeline_test] {'title': 'test_title', 'director': 'Bennett Miller', 'year':
'2011', 'rating': 0.7886308380991354}
[streaming_pipeline_test] {'title': 'test_title', 'director': 'Bennett Miller', 'year':
'2011', 'rating': 0.0673276352704153}
[streaming_pipeline_test] {'title': 'test_title', 'director': 'Bennett Miller', 'year':
'2011', 'rating': 0.37935417366095525}
[streaming_pipeline_test] {'title': 'test_title', 'director': 'Bennett Miller', 'year':
'2011', 'rating': 0.21110511524126563}
[streaming_pipeline_test] [Opensearch] Batch ingest data [{'title': 'test_title',
'director': 'Bennett Miller', 'year': '2011', 'rating': 0.7010424523477785}, {'title':
'test_title', 'director': 'Bennett Miller', 'year': '2011', 'rating': 0.7886308380991354},
{'title': 'test_title', 'director': 'Bennett Miller', 'year': '2011', 'rating':
0.0673276352704153}, {'title': 'test_title', 'director': 'Bennett Miller', 'year': '2011',
'rating': 0.37935417366095525}, {'title': 'test_title', 'director': 'Bennett Miller',
'year': '2011', 'rating': 0.21110511524126563}], time=1665618592.2294626

Consume messages using Python

If you want to programmatically consume messages from a Kafka topic, here is a code snippet:

from kafka import KafkaConsumer
import time

topic = 'test'
consumer = KafkaConsumer(
    topic,
    group_id='test',
    bootstrap_servers='kafka:9093',
)

for message in consumer:
    print(f"{message.partition}:{message.offset}: v={message.value}, time={time.time()}")

Run in production

If you want to programmatically consume messages from a Kafka topic, here is a code snippet:

Create a trigger.
Once the trigger is created, click the Start trigger button at the top of the page to make the streaming pipeline active.

Source: Giphy

Conclusion

In conclusion, Mage is an exceptional tool for stream data processing, adept at managing data from various sources and transforming it through real-time and batch pipelines using Python, SQL, and R. It stands out in its capacity to efficiently handle thousands of pipelines simultaneously, ensuring smooth operations and data integrity. Given the increasing need for real-time data processing in today's data-driven world, Mage is positioned as a vital tool in the arsenal of data professionals. Its versatility and robust capabilities make it a reliable choice for handling complex and voluminous streaming data.

Link to the original blog: https://www.mage.ai/blog/stream-data-processing-with-Mage

A mage on the Hero’s Journey: a fantasy epic on how a startup rose from the ashes

Mage AI — Mon, 12 Jun 2023 23:24:05 +0000

Outline

TLDR
Prologue
Adventure, trials, and tribulations
Death, rebirth, and transformation
Battle, freedom, and victory
Epilogue

TLDR

Mage pivoted from an AI platform to an open-source data pipeline tool and is making a huge impact on the lives of data engineers around the world.

Prologue

Once upon a time, there lived a promising young mage that left the magic academy early, journeying off on her own to make a difference in the world.

The young mage journeying off on her own to make a difference in the world.

She believed that helping townsfolk harness the power of AI would create an economic boom across villages around the world.

Adventure, trials, and tribulations

The young mage created an AI tool to help developers at small companies build, train, and deploy AI models. Initially, the villagers were excited and had lots of interest in using the tool. Many of them lined up for days outside the village just to schedule demos and paid trials.

The young mage showing off the magical AI tool to the villagers.

However, when the day approached to implement the tool, villagers kept giving the young mage reasons for why they weren’t ready. Some of these reasons included:

“We need a data warehouse first.”
“We need to make our first data hire.”
“We need a data pipeline management tool first.”
“We need to organize and clean our data first.”

After many sleepless nights, the young mage sensed that an evil dark presence had secretly infiltrated the village. This dark force came to be known as the Harbinger of Unnecessary Tools.

The Harbinger of Unnecessary Tools secretly infiltrates villages and companies.

In that moment, the young mage realized that the AI tool was not a necessity; the villagers had more urgent problems that needed a remedy immediately. This realization was devastating because the AI tool had been worked on for over a year.

The young mage was defeated by the Harbinger of Unnecessary Tools. In pain and despair, she was driven out of the town and went into hiding; uncertain of her future.

The young mage defeated… but not giving up yet!

Here are a few lessons inscribed in her tome:

Pay attention to why users don’t give a resounding “yes”. Avoid “maybe” like the black plague. They can be a secret poison because it gives hope that it’ll eventually be a “yes” when in fact it’s a “no”.
Fail fast. Find ways to prove the product wrong as fast as possible. The quicker it fails, the more chances there are to try something different and succeed.

Death, rebirth, and transformation

The young mage began doubting herself, questioning whether she left the magic academy too early. Injured and depleted of mana (energy that powers magic), she began wandering aimlessly through the abyss of the multiverse. Along the way, she spoke with nearly a thousand data professionals and asked them this question: what was the most boring part of your work?

The young mage traveling through the multiverse.

Enlightened by the responses, the young mage meditated on the reasons why the villagers weren’t ready to implement the AI tool. After meditating and deciphering the arcane knowledge of responses gathered throughout the multiverse, the young mage had a revelation: companies need urgent help moving their data and preparing it for usage.

The young mage began rebuilding and leveling up her powers. She trained day and night for what seemed like an eternity. The young mage took some of the technology she used in the previous AI tool, infused it with power-ups, and open-sourced it. Legend has it that her reborn powers are known as the Data Pipeline Tool.

The young mage training, leveling up, and upgrading her magic attributes.

However, this was no ordinary tool; it has 3 major differentiators:

It’s designed to have the easiest developer experience by providing a user interface; enabling developers to build data pipelines visually, quickly, and intuitively.
It combines 3 use cases that have strong synergy and affinity for one another: batch processing pipelines, data integration pipelines, and streaming pipelines.
Engineering best practices are built-in. The tool enables modular design of data pipelines; making each step in your pipeline easily reusable and simple to test with data validations.

At full power, the mage was ready to return and defend the village from the Harbinger of Unnecessary Tools!

Mage at full power.

Battle, freedom, and victory

The mage walked across astral planes and arrived at the village that was being oppressed by the Harbinger of Unnecessary Tools. She summoned all her powers and open-sourced the data pipeline tool.

The mage getting ready to battle the Harbinger of Unnecessary Tools.

After releasing the tool, spells of fire, water, wind, and lightning were cast at the dark force. With every passing moment, the open-source tool grew more powerful. As the battle raged on, bugs were eliminated, scalability issues were banished, powerful new features were added, and chromatic color began returning to the village.

Mage casting spells at the dark force.

After a fortnight of intense dueling, the Harbinger of Unnecessary Tools was finally defeated! The darkness of an unnecessary product, that had previously haunted the people, was lifted once and for all.

The Harbinger of Unnecessary Tools defeated.

Everyone was liberated from the dark force’s grip and joy overflowed in the streets. The entire village praised the mage for saving them from the pain of using data tools with a dreadful developer experience. Countless villagers expressed their gratitude and gave many thanks to the mage.

The darkness of dreadful developer experiences have finally lifted.

Here are some of the testimonies from developers in that village:

“I was awestruck when I used Mage for the 1st time. It’s super clean and user friendly.” — Ajith Shetty, Data Engineer

“Recently tried Mage 🧙 and I must say I’m amazed by its developer centric usability.” — Salman Ahmed, Data Engineer

“Mage is such a refreshing orchestrator compared to Airflow.” — Anil Kulkarni, Senior Data Engineer

“All throughout this Slack space, you guys are quick, resourceful, and have an open mind. It really separates you from other orchestrators.” — Greg Lenane, Senior Analytics Engineer

“It took minimal work to start understanding and building pipelines as opposed to Airflow.” — Pedro Dellazzari, Data Scientist

“I would like to express my love for using Mage. My experience with it has been fantastic so far.” — Fabián Sepúlveda, Data Engineer

“I truly appreciate being part of this amazing community and am honored to have had the opportunity to contribute to its success.” — Dhia Gharsallaoui, Data Architect

“Just wanted to say that I’m incredibly impressed with what Mage is capable of. It’s incredibly powerful and user friendly.” — Matt Pegler, SVP of Innovation

“Hey! Transferring all our stuff to Mage from Airflow. We have around 80 pipelines running (and will be growing), managed by a team of 4.” — Nazari Goudin, Head of Data

“Pushed existing dbt repo into Mage repo as sub module. Oh Man, Mage is 🔥” — Vijayasarathy Muthu, Data Engineer

“To be honest, I am really loving Mage.” — Alexander Bolaño, Senior Data Engineer

“Congrats on creating one helluva DX. Night and day for all other tools we’ve been testing.” — Tomas Roaldsnes

“I have never seen such a friendly place to ask questions, I love the openness of it!” — Davis Vance, Data Engineer

“I finished the tutorial and my reaction was… DAMMMMM! It is a really nice platform! OMG” — Paulo Mota, Data Engineer

“I love the tool and see potential for not only being a standard tool but also a standard user experience. This is an amazing product y’all.” — Farman Pirzada, Senior Software Engineer

“Just deleted Cloud Composer yesterday and fully moved to Mage.” — Le Minh Nguyen, Data Scientist

“I’ve been using Mage for 2 months now and I must say that I’m really impressed with the work that has been done so far.” — Dion Salomon, Data Engineer

“It massively lowers the barrier for entry on data engineering, which has had to turn into its own specialized profession (Airflow isn’t exactly fun to debug).” — Nicolas Essipova, CTO

“I think this is one of the most beautiful pieces of software I’ve ever used. There is powerful sorcery at work here.” — Patrick Clark, Data Engineer

Since the release of the open-source data pipeline tool in June 2022, the project has received over 4.6 thousand stars on GitHub, over 2.1 thousand community members on Slack, and over a hundred teams using the tool in production.

The tool has traction in the village.

In return for defeating the Harbinger of Unnecessary Tools, the Magic Council of Venture Capital decided to award the mage with an additional investment of $5 million. This investment was led again by Gradient Ventures, included previous investors (Essence VC, Designer Fund, Mana Ventures), and added an amazing group of strategic angel investors:

Guillermo Rauch (CEO @ Vercel)
Scott Breitenother (CEO @ Brooklyn Data Co)
Ananth Packkildurai (Data Engineering Weekly)
Ryan Boyd (Co-founder @ MotherDuck)
Jordan Tigani (CEO @ MotherDuck)
Benn Stancil (CTO @ Mode)
and several other data industry thought leaders.

Gold coins invested to continue powering up data engineers!

In addition to acquiring more gold, the mage gathered a powerful and wise group of advisors:

Powerful and wise Mage advisors.

Epilogue

Mage is on a mission to make data engineering more accessible so that developers can harness the power of data to create magical experiences.

Mage making advanced technology more accessible to the world.

That’s why Mage is recruiting other fellow mages who want to go on an adventure and achieve this mission together. Here are the roles in the party that are currently open:

Full Stack Engineer (Senior)
Front-end Engineer (Senior)
Backend Engineer (Senior)
Developer Success Engineer (aka Developer Relations)
Head of Developer Success

In the coming years, Mage will create a cooperative experience so that developers can build data pipelines with their team and level up together. After that journey, Mage will go on an epic quest to create the 1st open world community experience in the data universe.

Open World mode, unlocked.

Thank you so much for your relentless support of the mission and continued belief in the century long vision!

It’s never the end. Mage will be the 1st millenia long company.

How to build a data pipeline using Delta Lake

Mage AI — Fri, 19 May 2023 07:26:11 +0000

TLDR

Combine powerful database features with the flexibility of an object storage system by using the Delta Lake framework.

Outline

Intro
Prologue
Defend the planet
Final battle
Epilogue

Intro

Delta Lake

What’s Delta Lake?

This sounds like a new trending destination to take selfies in front of, but it’s even better than that. Delta Lake is an “open-source storage layer designed to run on top of an existing data lake and improve its reliability, security, and performance.” (source). It let’s you interact with an object storage system like you would with a database.

Why it’s useful?

An object storage system (e.g. Amazon S3, Azure Blob Storage, Google Cloud Platform Cloud Storage, etc.) makes it easy and simple to save large amounts of historical data and retrieve it for future use.

The downside of such systems is that you don’t get the benefits of a traditional database; e.g. ACID transactions, rollbacks, schema management, DML (data manipulation language) operations like merge and update, etc.

Delta Lake gives you best of both worlds. For more details on the benefits, check out their documentation.

Install Delta Lake

To install the Delta Lake Python library, run the following command in your terminal:

pip install deltalake

Setup Delta Lake storage

Delta Lake currently supports several storage backends:

Amazon S3
Azure Blob Storage
GCP Cloud Storage

Before we can use Delta Lake, please setup one of the above storage options. For more information on Delta Lake’s supported storage backends, read their documentation on Amazon S3, Azure Blob Storage, and GCP Cloud Storage.

Prologue

We live in a multi-verse of planets and galaxies. Amongst the multi-verse, there exists a group of invaders determined to conquer all friendly planets. They call themselves the Gold Legion.

The Gold Legion

Many millennia ago, the Gold Legion conquered vast amounts of planets whose name have now been lost in history. Before these planets fell, they spent their final days exporting what they learned about their invaders, into the fabric of space; with the hopes of future generations surviving the oncoming calamity.

Share our data to save the universe

Along with the battle data, these civilizations exported their avatar’s magic energy into the cosmos so that others can one day harness it.

How to use Delta Lake

The current unnamed planet is falling. We have 1 last chance to export the battle data we learned about the Gold Legion. We’re going to use a reliable technology called Delta Lake for this task.

First, download a CSV file and create a dataframe object:

import pandas as pd


df = pd.read_csv(
    'https://raw.githubusercontent.com/mage-ai/datasets/master/battle_history.csv',
)

Next, create a Delta Table from the dataframe object:

from deltalake.writer import write_deltalake


write_deltalake(
    # Change this URI to your own unique URI
    's3://mage-demo-public/battle-history/1337',
    data=df,
    mode='overwrite',
    overwrite_schema=True,
    storage_options={
        'AWS_REGION': '...',
        'AWS_ACCESS_KEY_ID': '...',
        'AWS_SECRET_ACCESS_KEY': '...',
        'AWS_S3_ALLOW_UNSAFE_RENAME': 'true',
    },
)

If you want to append the data to an existing table, change the mode argument to append.

If you don’t want to change the schema when writing to an existing table, change the overwrite_schema argument to False.

When creating or appending data to a table, you can optionally write that data using partitions. Set the keyword argument partition_by to a list of 1 or more column names to use as the partition for the table. For example, partition_by=['planet', 'universe'].

For more options to customize your usage of Delta Lake, check out their awesome API documentation.

If you’re not sure what keys are available to use in the storage options dictionary, refer to these examples depending on the storage backend you’re using:

Amazon S3
Azure Blob Storage
GCP Cloud Storage

Defend the planet

Fast forward to the present day and the Gold Legion has found Earth. They are beginning the invasion of our home planet. We must defend it!

Defend Earth!

Load data from Delta Lake

Let’s use Delta Lake to load battle history data from within the fabric of space.

If you don’t have AWS credentials, you can use these read-only credentials:

AWS_ACCESS_KEY_ID=AKIAZ4SRK3YKQJVOXW3Q
AWS_SECRET_ACCESS_KEY=beZfChoieDVvAVl+4jVvQtKm7HqbNrQun9ARMZDy

from deltalake import DeltaTable


dt = DeltaTable(
    # Change this to your unique URI from a previous step
    # if you’re using your own AWS credentials.
    's3://mage-demo-public/battle-history/1337',
    storage_options={
        'AWS_ACCESS_KEY_ID': '...',
        'AWS_SECRET_ACCESS_KEY': '...',
    },
)
dt.to_pandas()

Here is how the data could look:

Sample battle history data

Now that we’ve acquired battle data from various interstellar planets across the multi-verse spanning many millennia, planet Earth has successfully halted the Gold Legion’s advances into the atmosphere!

We successfully defended the planet!

However, the invaders are still in the Milky Way and are plotting their next incursion into our planet. Do you want to repel them once and for all? If so, proceed to the section labeled “Craft data pipeline (optional)”.

Time travel with versioned data

In the multi-verse, time is a concept that can be controlled. With Delta Lake, you can access data that has been created at different times. For example, let’s take the battle data, create a new table, and append data to that table several times:

from deltalake.writer import write_deltalake


# ['Aiur', 'Eos', 'Gaia', 'Kamigawa', 'Korhal', 'Ravnica']
planets = list(sorted(set(df['planet'].values)))

# Loop through each planet
for planet in planets:
    # Select a subset of the battle history data for a single planet
    planet_df = df.query(f"`planet` == '{planet}'")

    # Write to Delta Lake for each planet and keep appending the data
    write_deltalake(
        # Change this URI to your own unique URI
        's3://mage-demo-public/battle-history-versioned/1337',
        data=planet_df,
        mode='append',
        storage_options={
            'AWS_REGION': '...',
            'AWS_ACCESS_KEY_ID': '...',
            'AWS_SECRET_ACCESS_KEY': '...',
            'AWS_S3_ALLOW_UNSAFE_RENAME': 'true',
        },
    )
    print(
        f'Created table with {len(planet_df.index)} records for planet {planet}.',
    )

This operation will have appended data 6 times. Using Delta Lake, you can travel back in time and retrieve the data using the version parameter:

from deltalake import DeltaTable




dt = DeltaTable(
    # Change this to your unique URI from a previous step
    # if you’re using your own AWS credentials.
    's3://mage-demo-public/battle-history-versioned/1337',
    storage_options={
        'AWS_ACCESS_KEY_ID': '...',
        'AWS_SECRET_ACCESS_KEY': '...',
    },
    version=0,
)
dt.to_pandas()

The table returned will only include data from the planet Aiur because the 1st append operation only had data from that planet. If you change the version argument value from 0 to 1, the table will include data from Aiur and Eos.

Craft data pipeline (optional)

If you made it this far, then you’re determined to stop the Gold Legion for good. In order to do that, we must build a data pipeline that will continuously gather magic energy in addition to constantly collecting battle data from space.

Load data, transform data, export data

Once this data is loaded, we’ll transform the data by deciphering its arcane knowledge and combining it all into a single concentrated source of magical energy.

The ancients, that came to our planet thousands of years before us, knew this day would come. They crafted a magical data pipeline tool called Mage that we’ll use to fight the enemy.

Magical data pipelines

Go to demo.mage.ai, and click the New button in the top left corner, and select the option labeled Standard (batch).

Create new data pipeline

Load magic energy

We’ll load the magic energy from the cosmos by reading a table using Delta Lake.

Click the button + Data loader, select Python, and click the option labeled Generic (no template).

Add data loader block

Paste the following code into the text area:

from deltalake import DeltaTable


@data_loader
def load_data(*args, **kwargs):
    dt = DeltaTable(
        's3://mage-demo-public/magic-energy/1337', 
        storage_options={
            'AWS_ACCESS_KEY_ID': '...',
            'AWS_SECRET_ACCESS_KEY': '...',
        },
    )
    return dt.to_pandas()

Use the following read-only AWS credentials to read from S3:

AWS_ACCESS_KEY_ID=AKIAZ4SRK3YKQJVOXW3Q
AWS_SECRET_ACCESS_KEY=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

Click the play icon button in the top right corner of the block to run the code:

Run code and preview results

Transform data

Now that we’ve retrieved the magic energy from the cosmos, let’s combine it with the battle history data.

Click the button + Transformer, select Python, and click the option labeled Generic (no template).

Add transformer block

Paste the following code into the text area:

from deltalake import DeltaTable
import pandas as pd


@transformer
def transform(magic_energy, *args, **kwargs):
    dt = DeltaTable(
        # Change this to your unique URI from a previous step
        # if you’re using your own AWS credentials.
        's3://mage-demo-public/battle-history/1337',
       storage_options={
            'AWS_ACCESS_KEY_ID': '...',
            'AWS_SECRET_ACCESS_KEY': '...',
        },
    )
    battle_history = dt.to_pandas()

    return pd.concat([magic_energy, battle_history])

Click the play icon button in the top right corner of the block to run the code:

Run code and preview results

Export data

Now that we’ve combined millennia worth of battle data with magical energy from countless planets, we can channel that single source of energy into Earth’s Avatar of the Lake.

Click the button + Data exporter, select Python, and click the option labeled Generic (no template).

Add data exporter block

Paste the following code into the text area:

from deltalake.writer import write_deltalake


@data_exporter
def export_data(combined_data, *args, **kwargs):
    write_deltalake(
        # Change this URI to your own unique URI
        's3://mage-demo-public/magic-energy-and-battle-history/1337',
        data=combined_data,
        mode='overwrite',
        overwrite_schema=True,
        storage_options={
            'AWS_REGION': '...',
            'AWS_ACCESS_KEY_ID': '...',
            'AWS_SECRET_ACCESS_KEY': '...',
            'AWS_S3_ALLOW_UNSAFE_RENAME': 'true',
        },
        partition_by=['planet'],
    )

Click the play icon button in the top right corner of the block to run the code:

Run code

Your final magical data pipeline should look something like this:

Data pipeline in Mage

Data partitioning

Partitioning your data can improve read performance when querying records. Delta Lake makes data partitioning very easy. In the last data exporter step, we used a keyword argument named partition_by with the value ['planet']. This will partition the data by the values in the planet column.

To retrieve the data for a specific partition, use the following API:

from deltalake import DeltaTable


dt = DeltaTable(
    # Change this to your unique URI from a previous step
    # if you’re using your own AWS credentials.
    's3://mage-demo-public/magic-energy-and-battle-history/1337',
    storage_options={
        'AWS_ACCESS_KEY_ID': '...',
        'AWS_SECRET_ACCESS_KEY': '...',
    },
)
dt.to_pandas(partitions=[('planet', '=', 'Gaia')])

This will return data only for the planet Gaia.

Final battle

The Gold Legion’s armies descend upon Earth to annihilate all that stand in its way.

Invasion of Earth

As Earth makes its final stand, mages across the planet channel their energy to summon the Avatar of the Lake from its century long slumber.

Summon the Avatar

The Gold Legion’s forces clash with the Avatar. At the start of the battle, the Avatar of the Lake land several powerful blows against the enemy; destroying many of their forces. However, the Gold Legion combines all of its forces into a single entity and goes on the offensive.

Gold Legion’s final boss

Earth’s Avatar is greatly damaged and weakened after a barrage of attacks from the Gold Legion’s unified entity. When all hope seemed lost, the magic energy from the cosmos and the battle data from the fabric of space finally merges together and is exported from Earth into the Avatar; filling it with unprecedented celestial power.

Avatar of the Lake at full power

The Avatar of the Lake, filled with magic power, destroys the Gold Legion and crushes their will to fight. The invaders leave the galaxy and never return!

Epilogue

Congratulations! You learned how to use Delta Lake to create tables and read tables. Using that knowledge, you successfully saved the multi-verse.

In addition, you defended Earth by using Mage to create a data pipeline to load data from different sources, transform that data, and export the final data product using Delta Lake.

The multi-verse can rest easy knowing heroes like you exist.

You’re a Hero!

Stay tuned for the next episode in the series where you’ll learn how to build low-code data integration pipelines syncing data between various sources and destinations with Delta Lake.

Link to original blog: https://www.mage.ai/blog/how-to-build-a-data-pipeline-using-delta-lake

Getting started with Apache Flink: A guide to stream processing

Mage AI — Mon, 15 May 2023 23:05:15 +0000

TLDR

This guide introduces Apache Flink and stream processing, explaining how to set up a Flink environment and create simple applications. Key Flink concepts are covered along with basic troubleshooting and monitoring techniques. It ends with resources for further learning and community support.

Outline

Introduction to Apache Flink and stream processing
Setting up a Flink development environment
A simple Flink application walkthrough: data ingestion, processing, and output
Understanding Flink’s key concepts (DataStream API, windows, transformations, sinks, sources)
Basic troubleshooting and monitoring for Flink applications
Conclusion

Introduction to Apache Flink and Stream Processing

Apache Flink is an open-source, high-performance framework designed for large-scale data processing, excelling at real-time stream processing. It features low-latency and stateful computations, enabling users to process live data and generate insights on-the-fly. Flink is fault-tolerant, scalable, and provides powerful data processing capabilities that cater to various use cases.

Stream processing, on the other hand, is a computing paradigm that allows real-time data processing as soon as it arrives or is produced. Unlike traditional batch processing systems that deal with data at rest, stream processing handles data in motion. This paradigm is especially useful in scenarios where insights need to be derived immediately, such as real-time analytics, fraud detection, and event-driven systems. Flink's powerful stream-processing capabilities and its high-throughput, low-latency, and exactly-once processing semantics make it an excellent choice for such applications.

Source: Giphy

Setting up a Flink development environment

Setting up a development environment for Apache Flink is a straightforward process. Here's a brief step-by-step guide:

Install Java: Flink requires Java 8 or 11, so you need to have one of these versions installed on your machine. You can download Java from the Oracle website or use OpenJDK.
Download and Install Apache Flink: You can download the latest binary of Apache Flink from the official Flink website. Once downloaded, extract the files to a location of your choice.
Start a Local Flink Cluster: Navigate to the Flink directory in a terminal, then go to the 'bin' folder. Start a local Flink cluster using the command ./start-cluster.sh (for Unix/Linux/macOS) or start-cluster.bat (for Windows).
Check Flink Dashboard: Open a web browser and visit http://localhost:8081, you should see the Flink Dashboard, indicating that your local Flink cluster is running successfully.
Set up an Integrated Development Environment (IDE): For writing and testing your Flink programs, you can use an IDE such as IntelliJ IDEA or Eclipse. Make sure to also install the Flink plugin if your IDE has one.
Create a Flink Project: You can create a new Flink project (Refer - Apache Flink Playground) using a build tool like Maven or Gradle. Flink provides quickstart Maven archetypes to set up a new project easily.

Once you've set up your Flink development environment, you're ready to start developing Flink applications. Remember that while this guide describes a basic local setup, a production Flink setup would involve a distributed cluster and possibly integration with other big data tools.

Source: Giphy

A simple Flink application walkthrough: Data ingestion, Processing and Output

A simple Apache Flink application can be designed to consume a data stream, process it, and then output the results. Let's walk through a basic example:

Data Ingestion (Sources): Flink applications begin with one or more data sources. A source could be a file on a filesystem, a Kafka topic, or any other data stream.
Data Processing (Transformations): Once the data is ingested, the next step is to process or transform it. This could involve filtering data, aggregating it, or applying any computation.
Data Output (Sinks): The final step in a Flink application is to output the processed data, also known as a sink. This could be a file, a database, or a Kafka topic.
Job Execution: After defining the sources, transformations, and sinks, the Flink job needs to be executed.

Here's a complete example that reads data from a Kafka topic, performs some basic word count processing on the stream, and then writes the results into a Cassandra table. This example uses Java and Flink's DataStream API.

import org.apache.flink.api.common.functions.FlatMapFunction;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.connectors.cassandra.CassandraSink;
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer;
import org.apache.flink.util.Collector;
import org.apache.kafka.common.serialization.SimpleStringSchema;

import java.util.Properties;

public class KafkaToCassandraExample {

    public static void main(String[] args) throws Exception {

        final StreamExecutionEnvironment env =
StreamExecutionEnvironment.getExecutionEnvironment();

        Properties properties = new Properties();
        properties.setProperty("bootstrap.servers", "localhost:9092"); // address of your
Kafka server
        properties.setProperty("group.id", "test"); // specify your Kafka consumer group

        DataStream<String> stream = env.addSource(new FlinkKafkaConsumer<>("topic", new
SimpleStringSchema(), properties));

        DataStream<Tuple2<String, Integer>> processedStream = stream
                .flatMap(new Tokenizer())
                .keyBy(0)
                .sum(1);

        CassandraSink.addSink(processedStream)
                .setQuery("INSERT INTO wordcount.word_count (word, count) values (?, ?);")
                .setHost("127.0.0.1") // address of your Cassandra server
                .build();

        env.execute("Kafka to Cassandra Word Count Example");
    }

    public static final class Tokenizer implements FlatMapFunction<String, Tuple2<String,
Integer>> {
        @Override
        public void flatMap(String value, Collector<Tuple2<String, Integer>> out) {
            // normalize and split the line into words
            String[] words = value.toLowerCase().split("\\W+");

            // emit the words
            for (String word : words) {
                if (word.length() > 0) {
                    out.collect(new Tuple2<>(word, 1));
                }
            }
        }
    }
}

Source: Giphy

Understanding Flink’s key concepts

DataStream API: Flink's main tool for creating stream processing applications, providing operations to transform data streams.
Windows: Defines a finite set of stream events for computations, based on count, time, or sessions.
Transformations: Operations applied to data streams to produce new streams, including map, filter, flatMap, keyBy, reduce, aggregate, and window.
Sinks: The endpoints of Flink applications where processed data ends up, such as a file, database, or message queue.
Sources: The starting points of Flink applications that ingest data from external systems or generate data internally, such as a file or Kafka topic.
Event Time vs. Processing Time: Flink supports different notions of time in stream processing. Event time is the time when an event occurred, while processing time is the time when the event is processed by the system. Flink excels at event time processing, which is crucial for correct results in many scenarios.
CEP (Complex Event Processing): Flink supports CEP, which is the ability to detect patterns and complex conditions across multiple streams of events.
Table API & SQL: Flink offers a Table API and SQL interface for batch and stream processing. This allows users to write complex data processing applications using a SQL-like expression language.
Stateful Functions (StateFun): StateFun is a framework by Apache Flink designed to build distributed, stateful applications. It provides a way to define, manage, and interact with a dynamically evolving distributed state of functions.
Operator Chain and Task: Flink operators (transformations) can be chained together into a task for efficient execution. This reduces the overhead of thread-to-thread handover and buffering.
Savepoints: Savepoints are similar to checkpoints, but they are triggered manually and provide a way to version and manage the state of Flink applications. They are used for planned maintenance and application upgrades.
State Management: Flink provides fault-tolerant state management, meaning it can keep track of the state of an application (e.g., the last processed event) and recover it if a failure occurs.
Watermarks: These are a mechanism to denote progress in event time. Flink uses watermarks to handle late events in stream processing, ensuring the system can handle out-of-order events and provide accurate results.
Checkpoints: Checkpoints are a snapshot of the state of a Flink application at a particular point in time. They provide fault tolerance by allowing an application to revert to a previous state in case of failures.

Source: Giphy

Basic troubleshooting and monitoring in Flink

Troubleshooting and monitoring are essential aspects of running Apache Flink applications. Here are some key concepts and tools:

Flink Dashboard: This web-based user interface provides an overview of your running applications, including statistics on throughput, latency, and CPU/memory usage. It also allows you to drill down into individual tasks to identify bottlenecks or issues.
Logging: Flink uses SLF4J for logging. Logs can be crucial for diagnosing problems or understanding the behavior of your applications. Log files can be found in the log directory in your Flink installation.
Metrics: Flink exposes a wide array of system and job-specific metrics, such as the number of elements processed, bytes read/written, task/operator/JobManager/TaskManager statistics, and more. These metrics can be integrated with external systems like Prometheus or Grafana.
Exceptions: If your application fails to run, Flink will throw an exception with a stack trace, which can provide valuable information about the cause of the error. Reviewing these exceptions can be a key part of troubleshooting.
Savepoints/Checkpoints: These provide a mechanism to recover your application from failures. If your application isn't recovering correctly, it's worth investigating whether savepoints/checkpoints are being made correctly and can be successfully restored.
Backpressure: If a part of your data flow cannot process events as fast as they arrive, it can cause backpressure, which can slow down the entire application. The Flink dashboard provides a way to monitor this.
Network Metrics: Flink provides metrics on network usage, including buffer usage and backpressure indicators. These can be useful for diagnosing network-related issues.

Remember, monitoring and troubleshooting are iterative processes. If you notice performance degradation or failures, use these tools and techniques to investigate, identify the root cause, and apply a fix. Then monitor the system again to ensure that the problem has been resolved.

Source: Giphy

Conclusion

In conclusion, Apache Flink is a robust and versatile open-source stream processing framework that enables fast, reliable, and sophisticated processing of large-scale data streams. Starting with a simple environment setup, we've walked through creating a basic Flink application that ingests, processes, and outputs data. We've also touched on the foundational concepts of Flink, such as the DataStream API, windows, transformations, sinks, and sources, all of which serve as building blocks for more complex applications.

In episode 4 of Apache Flink series, we'll see how to consume data from kafka in real time and process it with Mage.

Link to the original blog: https://www.mage.ai/blog/getting-started-with-apache-flink-a-guide-to-stream-processing

Apache Flink vs Apache Spark: A detailed comparison for data processing

Mage AI — Mon, 08 May 2023 22:01:10 +0000

TLDR

Dive into a comprehensive comparison of Apache Flink and Apache Spark, exploring their differences and strengths in data processing, to help you decide which framework best suits your data processing needs.

Outline

Introduction to Apache Flink and Apache Spark
Comparison of key features
Performance benchmarks and scalability
Recommendations for choosing the right tool for specific use cases
Conclusion

Introduction to Apache Flink and Apache Spark

Apache Spark, on the other hand, is a versatile, open-source data processing framework that offers an all-in-one solution for batch processing, machine learning, and graph processing. It is known for its ease of use and comprehensive library of built-in tools and algorithms. Like Flink, Spark is fault-tolerant, scalable, and delivers high-performance data processing. Spark's versatility makes it suitable for a wide range of applications and industries.

Source: Giphy

Comparison of key features

Apache Flink and Apache Spark differ in numerous ways; let's examine their distinctions by comparing key features.

Processing Models:

Apache Flink: Primarily focused on real-time stream processing, Flink efficiently processes large volumes of data with low-latency. Flink's processing engine is built on top of its own streaming runtime and can also handle batch processing.
Apache Spark: Originally designed for batch processing, Spark later introduced a micro-batching model for handling streaming data. While it can process streaming data, its performance in terms of latency is generally higher than Flink's.

APIs and Libraries:

Apache Flink: Provides a robust set of APIs in Java, Scala, and Python for developing data processing applications. Flink's libraries include FlinkML for machine learning, FlinkCEP for complex event processing, and Gelly for graph processing.
Apache Spark: Offers APIs in Java, Scala, Python, and R, making it accessible to a wider range of developers. Spark also has comprehensive libraries, such as MLlib for machine learning, GraphX for graph processing, and Spark Streaming for processing real-time data.

Fault Tolerance:

Apache Flink: Utilizes a distributed snapshotting mechanism, allowing for quick recovery from failures. The state of the processing pipeline is periodically checkpointed, ensuring data consistency in case of failures.
Apache Spark: Employs a lineage information-based approach for fault tolerance. Spark keeps track of the data transformation sequence, enabling it to recompute lost data in case of failures.

Windowing:

Apache Flink: Offers advanced windowing capabilities, including event-time and processing-time windows, as well as session windows for handling complex event patterns. Flink's windowing features are particularly suitable for real-time stream processing.
Apache Spark: Provides basic windowing functionality, such as tumbling and sliding windows, which work well for batch and micro-batching scenarios but may not be as suited for real-time stream processing.

Source: Giphy

Performance benchmark & scalability

Dive into a comparison of Flink and Spark based on their performance benchmarks and scalability. Discover how they handle processing speed, in-memory computing, resource management, and more.

Processing Speed: Flink excels in low-latency, high-throughput stream processing, while Spark is known for its fast batch processing capabilities. Both frameworks can process large volumes of data quickly, with Flink focusing on real-time analytics and Spark catering to batch data processing tasks.
In-Memory Computing: Both Flink and Spark leverage in-memory computing, which allows them to cache intermediate results during data processing tasks. This approach significantly reduces the time spent on disk I/O operations and improves overall performance.
Resource Management: Flink and Spark can efficiently manage resources by dynamically allocating and deallocating them according to workload requirements. This enables both frameworks to scale horizontally, handling large-scale data processing tasks across multiple nodes in a distributed environment.
Adaptive Query Execution: Spark's Adaptive Query Execution (AQE) feature optimizes query execution plans at runtime, allowing it to adapt to changing data and workload characteristics. This results in improved performance and resource utilization. Flink, on the other hand, does not currently have an equivalent feature.
Backpressure Handling: Flink is designed to handle backpressure, ensuring that the system remains stable even under high loads. This is achieved through its built-in flow control mechanisms, which prevent data processing bottlenecks. Spark Streaming, in contrast, may struggle to handle backpressure, leading to potential performance degradation.
Data Partitioning: Both Flink and Spark utilize data partitioning techniques to improve parallelism and optimize resource utilization during data processing tasks. While Spark employs RDDs and data partitioning strategies like Hash and Range partitioning, Flink uses **operator chaining **and pipelined execution to optimize data processing performance.

Source: Giphy

Recommendations for choosing the right tool for specific use cases

When selecting the right tool between Flink and Spark for specific use cases, consider the following unique technical aspects:

Real-time processing: If low-latency, real-time processing is a priority, Flink is the better choice, as it was designed specifically for streaming data and offers near-instantaneous processing capabilities.
Batch processing: Spark excels in batch processing and large-scale data processing tasks, with its powerful in-memory processing capabilities and optimized execution engine. If your primary focus is on batch processing, Spark is the recommended choice.
Machine Learning: Spark's MLlib library offers a comprehensive suite of machine learning algorithms and utilities. If machine learning is a key aspect of your project, Spark is a more suitable choice.
Graph processing: If your use case involves graph processing, Spark's GraphX library provides a robust and flexible solution for large-scale graph computations. Flink, on the other hand, has Gelly for graph processing, but it is less mature compared to GraphX.
Stateful processing: Flink provides better support for stateful processing, making it ideal for use cases that require maintaining and updating state information during stream processing.
API maturity: While both Flink and Spark provide APIs for various programming languages, Spark's APIs are more mature and stable, providing a better user experience and a wider range of features.
Community and ecosystem: Spark boasts a more extensive community and ecosystem, offering more resources, support, and third-party integrations. This can be a decisive factor if community support is important for your project.
Deployment options: Flink provides more flexibility in deployment, as it can be LLdeployed as a standalone cluster, on YARN, or Kubernetes. Spark, although it also supports YARN and Kubernetes, might have some limitations in standalone mode.

Overall, the choice between Flink and Spark depends on the specific requirements of your use case, such as machine learning capabilities, graph processing, stateful processing, API maturity, community support, and deployment options.

Source: Giphy

Conclusion

In conclusion, Apache Flink and Apache Spark are both powerful data processing frameworks, each with its unique strengths and capabilities. The choice between the two depends on your specific use case and requirements. Flink is particularly well-suited for stateful and real-time stream processing, while Spark excels in machine learning and graph processing. Ultimately, understanding the key differences, performance benchmarks, and scalability aspects of both frameworks will help you make an informed decision for your project. Consider factors such as API maturity, community support, and deployment options, along with the technical requirements of your application, to select the best tool that meets your needs.

In episode 3 of Apache Flink series, we'll see how to get started with Apache Flink.

Link to the original blog: https://www.mage.ai/blog/apache-flink-vs-apache-spark-detailed-comparison-for-data-processing

Forem: Mage AI

Streamline data transfer: The ultimate data integration guide SFTP to BigQuery

TLDR

Table of Contents

Business use case

Step by step implementation guide

Step 1: Creating a new pipeline

Step 2: Configuring the SFTP connection

Step 3: Selecting data streams

Step 4: Configuring data replication methods

Step 5: Transforming data

Step 6: Setting up Google BigQuery as destination

Step 7: Triggering the data sync

Step 8: Verifying data synced to Google BigQuery

Conclusion

We're back and better than ever!

What's new in Mage Pro?

Our open source project is still active. We've added:

Why choose Mage Pro?

Understanding DBT (Data Build Tool): An Introduction

TLDR

Outline

Overview of DBT

Core principles of DBT

DBT architecture

Challenges with DBT

Conclusion

Data Integration: Google BigQuery with Mage

TLDR

Outline

Introduction to Mage

Overview of Google BigQuery

Step by step process to migrate Google BigQuery with Mage

Conclusion

Google BigQuery: Serverless data warehousing made simple

TLDR

Outline

Introduction to Google BigQuery

Key features of Google BigQuery

Google BigQuery’s unique architecture

Benefits of using Google BigQuery

Snowflake: Revolutionizing data warehousing

TLDR

Outline

Introduction to snowflake

Key features of snowflake

Benefits of using snowflake

Conclusion

AWS Redshift: Robust and Scalable Data Warehousing

TLDR

Outline

Introduction to AWS Redshift

Key features of AWS Redshift

Redshift Architecture

Benefits and Use Cases of AWS Redshift

Conclusion

Stream data processing with Mage

TLDR

Outline

Introduction to Mage

Why is kafka a popular component of streaming applications?

Step by step guide to create streaming pipeline on Mage

Recommendations for choosing the right tool for specific use cases

Conclusion

A mage on the Hero’s Journey: a fantasy epic on how a startup rose from the ashes

Outline

TLDR

Prologue

Adventure, trials, and tribulations

Death, rebirth, and transformation

Battle, freedom, and victory

Epilogue

How to build a data pipeline using Delta Lake

TLDR

Outline

Intro

What’s Delta Lake?

Why it’s useful?

Install Delta Lake

Setup Delta Lake storage