Forem: Kriangsak Sumthong (Puk)

AWS MWAA Serverless: Is It Worth the Switch from Provisioned?

Kriangsak Sumthong (Puk) — Wed, 25 Feb 2026 15:58:29 +0000

In the world of Data Engineering, building data pipelines is unavoidable. But what’s even more critical is the Orchestrator. Why? Because we can't (and shouldn't) manually trigger every step of a pipeline. Automation is a necessity.

Today, we have a massive variety of orchestrators to choose from:

Open Source: Apache Airflow (the most popular), Dagster, and Prefect.
Managed/Closed Source: Cloud-native platforms like Google Cloud (Cloud Composer), Microsoft Azure, and Databricks.

Then there’s Amazon Web Services (AWS) with its managed Airflow offering: Amazon MWAA.

The Arrival of AWS MWAA Serverless

At the end of 2025, AWS introduced MWAA Serverless. The core concept is simple: allowing Data Engineers to focus on workflows without worrying about infrastructure management.

In the Provisioned version (non-serverless), AWS spins up dedicated infrastructure behind the scenes. This means you incur costs as long as the environment is active, even if no tasks are running. MWAA Serverless aims to change that.

*Orchestration on AWS: MWAA vs. Step Functions
It’s worth noting that MWAA isn’t the only orchestrator on AWS; AWS Step Functions is also a very popular choice. I’ll do a deep dive comparison in a future article, but for now, let's focus on MWAA.

A Quick Airflow Refresher
Before we dive in, remember that in Airflow, we write scripts called DAGs (Directed Acyclic Graphs). A DAG defines the structure of your workflow, and inside each DAG are Tasks—the smallest unit of work.

Amazon MWAA Provisioned vs. Serverless

1. Amazon MWAA Provisioned
This is essentially Apache Airflow deployed on AWS. AWS handles the compute, storage, and database, ensuring scalability, availability, and security. It feels exactly like the Airflow you’d deploy on your own server.

2. Amazon MWAA Serverless
This is the main highlight. With Serverless, you don’t set up infrastructure at all. You just drop your DAG into an S3 bucket and you're ready to go.

The Catch: MWAA Serverless only reads YAML files. If you have existing Python-based DAGs, you must convert them to YAML first. (Don't worry, AWS provides a library for this!). Alternatively, you can build DAGs using a drag-and-drop GUI within Amazon SageMaker.

How to use Amazon MWAA Serverless

There are two primary ways to create DAGs:

Method 1: Writing YAML DAGs manually
You can write your DAG in VS Code (or any editor) and upload it to S3.

Example YAML DAG:

simples3test:
  dag_id: simples3test
  tasks:
    list_objects:
      operator: airflow.providers.amazon.aws.operators.s3.S3ListOperator
      bucket: your-s3-bucket-name
      prefix: ""
      retries: 0
    create_object_list:
      dependencies:
        - list_objects
      operator: airflow.providers.amazon.aws.operators.s3.S3CreateObjectOperator
      data: Hello MWAA serverless
      s3_bucket: mwaa-serverless-test-cdlb
      s3_key: demo_text.txt
      replace: true
  schedule: 0 0 * * *

Once uploaded, you use the AWS CLI to register and execute the workflow:

# Create workflow
aws mwaa-serverless create-workflow \
--name simple_s3_test \
--definition-s3-location '{ "Bucket": "your-s3-bucket-name", "ObjectKey": "path/simple_s3_test.yaml" }' \
--role-arn arn:aws:iam::111122223333:role/mwaa-serverless-access-role \
--region us-east-1

# Execute the workflow
aws mwaa-serverless start-workflow-run \
--workflow-arn arn:aws:airflow-serverless:us-east-2:111122223333:workflow/simple_s3_test-abc1234def \
--region us-east-1

# Update workflow if you have made any changes.
aws mwaa-serverless update-workflow \
--workflow-arn arn:aws:airflow-serverless:us-east-2:111122223333:workflow/simple_s3_test-abc1234def \
--definition-s3-location '{ "Bucket": "your-s3-bucket-name", "ObjectKey": "path/simple_s3_test.yaml" }' \
--role-arn arn:aws:iam::111122223333:role/mwaa-serverless-access-role \
--region us-east-1

Method 2: AWS SageMaker Workflow (GUI)
You can create workflows visually. In the MWAA Serverless console, click "Create workflow," and it will redirect you to the SageMaker Canvas/Workflow interface. You can drag and drop tasks like "Save file to S3" or "Trigger Glue Job." It’s incredibly convenient.

Extra: Converting Python DAGs to YAML
If you want to move your existing dag.py files to MWAA Serverless, AWS provides a converter library:

# AWS Python to Yaml Dag Converter for MWAA Serverless
pip install python-to-yaml-dag-converter-mwaa-serverless

# To convert .py to .yaml
dag-converter convert <python-dag-file>

Same Name, Different Game?

While "Serverless" sounds like an upgrade, it is fundamentally different from the standard Airflow experience.

Key Differences & Limitations:
Limited Operators: MWAA Serverless focuses almost exclusively on AWS Operators (e.g., S3, Bedrock, Batch, Glue). Classic operators like PythonOperator or BashOperator are not supported. This is because AWS wants the heavy lifting (compute) to happen in other services, not within the serverless orchestrator itself.

Missing Parameters: Many standard Airflow parameters are unavailable. If you rely on things like email_on_failure, catchup, or on_success_callback, you’ll need to rethink your logic using other AWS services (like SNS for notifications).

Final Verdict

Go with MWAA Serverless if: You are building a "pure" AWS workflow, you want something easy to set up, and you prefer a GUI or simple YAML definitions.

Stay with MWAA Provisioned if: You are migrating a complex, existing Airflow environment. If your DAGs use dynamic tasks, DAG-trigger-DAG patterns, or custom Python logic within the operators, Serverless will be more of a headache than a help.

MWAA Serverless and Provisioned share a name and a concept, but they serve very different use cases. Choose the one that fits your architecture, not just the "Serverless" buzzword!

REF:https://docs.aws.amazon.com/mwaa/latest/mwaa-serverless-userguide/mwaas-concepts.html
🌐 Note: This article was translated from Thai with the help of AI to share these insights with the global community. You can find the original Thai version on my page, here: Clouddatalabor

Happy Coding!
Follow me for more insights on Cloud, Data, and AI at Clouddatalabor.

View vs. Materialized View | A Beginner’s Guide with AWS Athena & Redshift

Kriangsak Sumthong (Puk) — Mon, 24 Mar 2025 15:43:10 +0000

View vs. Materialized View: What are the differences between these two, and let's try creating them on Redshift and Athena.

What is a View Table?

Normally, when querying data from a database or warehouse, we retrieve it directly from the table, right? However, if we need to repeatedly use the same query or if we need to use the query we wrote in other places, such as in the backend, ETL, or other ELT processes, we have to include the SQL we wrote. Now, imagine that the SQL we wrote is over 100 lines long. What happens? It's messy! Managing the script becomes difficult.

For example, as seen in the picture, this SQL is very long and complex. Copying it to different places or trying to write more code based on it becomes difficult and has a high chance of errors. This means that after copying, the syntax might get distorted.

Therefore, something called a View table was created to reduce the complexity of querying. As a result, the SQL we saw above will be reduced to what's shown in the example below.

-- Example usecase, which is not really a good usecase LOL
SELECT * FROM company_table
WHERE member_id IN (SELECT employee_id FROM view_table)

-- If we not using view table
SELECT * FROM company_table
WHERE member_id IN (
SELECT employee_id 
FROM abc as a
LEFT JOIN efg as e
ON a.id = e.id
.
.
.
.
WHERE id IS NOT NULL);

It will be similar to querying a regular table, except that the data returned is the result of the query underlying that View table.

However! Modern data warehouses have both View Tables and Materialized View Tables. What are the differences between these two? The next section will explain each type of View.

View Table

View Table is a virtual representation of a table. In other words, when we create a View table, the physical data is not stored in the database. Instead, when we query it, the database executes the query that was used to create the View table. Therefore, no matter how many View tables we create, the storage space does not increase. Let's see how to create a View table.

-- syntax
CREATE [ OR REPLACE ] VIEW name [ ( column_name [, ...] ) ] AS query
[ WITH NO SCHEMA BINDING ] -- [] optional
-- example  use               
CREATE VIEW vw_myevents
AS
SELECT id FROM mockl;

After creating a View table and refreshing, we will find that the View table is located in the View section of both Redshift and Athena. Both services use similar syntax, which is CREATE VIEW.

In Redshift, we can see the query underlying the View by right-clicking and selecting 'Show view definitions'. We can also edit it directly from there.

In Athena, you can view and edit the query by right-clicking and selecting 'Show/edit Query'. You can also make additional edits directly from there.

Materialized View

Materialized View is a View that is similar to a regular View table, but with an added feature: Materialized View takes the data resulting from the query and creates a physical table that is stored in the database.

Querying data is faster with Materialized View compared to a regular View table. For example, if we have a join between table A and table B, with a regular View table, every time we query the view, the system has to process the join between table A and table B again. However, with Materialized View, the join processing is already done because the joined data is pre-stored.

Let's try creating a Materialized View in Redshift. (Athena cannot create Materialized Views because Athena is a serverless query engine and does not store any data. Therefore, it cannot create them. The workaround is to create a table from the query and store it in S3.🤓)

-- syntax
CREATE MATERIALIZED VIEW mv_name
[ BACKUP { YES | NO } ]
[ table_attributes ]
[ AUTO REFRESH { YES | NO } ]
AS query 

-- example use
CREATE MATERIALIZED VIEW "public"."mvw_myevent_demo"
AUTO REFRESH YES
AS
SELECT
    id, first_name, last_name
FROM
    "public"."mockl"

After refreshing the Editor page, you'll find that the Materialized View is now in the View section. As mentioned, Materialized View stores data in the database. Therefore, when you query SVV_TABLE_INFO (a table that stores information about system tables and user-defined tables), you'll see only the Materialized View, not the View table, and it also shows the data size.

The limitations of Materialized View are quite significant. Here are some examples of what cannot be used to create Materialized Views:

Standard views or system tables and views
Temporary tables
User-defined functions
ORDER BY, LIMIT, or OFFSET clauses

Pros and Cons of View and Materialized View

View Pros:

Simplifies queries, reducing lengthy queries to just one line.
Provides up-to-date data every time it's queried, reflecting the underlying data in the View (e.g., if there's no WHERE clause, you get the latest data).

View Cons:

Modifying a View requires deleting and recreating it.
Changes in the source tables of the View's query can break the View.
While it provides fresh data, it cannot be indexed or optimized directly;
optimization must be done on the underlying tables.

Materialized View Pros:

Faster query performance because it retrieves data from the pre-stored Materialized View, not the source tables.
Reduces ETL/ELT steps and simplifies modifications compared to creating new tables via ETL/ELT.
Redshift offers auto-refresh, updating data from source tables, eliminating the need for separate data update pipelines.

Materialized View Cons:

Increases database storage consumption, similar to creating a new table.
Not suitable for real-time data with constant updates; better for batch processing (even with manual refresh).
Cannot be created on Athena.

Regarding the selection of which to use, I have these general guidelines:

View:

If you want to avoid increasing storage costs.
If you need to use it with real-time data.
If the query is simple and not complex.

Materialized View:

If you want to improve query performance for dashboard displays, especially when pre-processing is required.
To reduce processing overhead from complex queries, as Materialized View pre-processes and stores data as a physical table.

Hope this help!

originated from my article in Thai: https://clouddatalabor.com/2025/03/24/view-vs-materialized-view-a-beginners-guide-with-aws-athena-redshift/