Forem: Marko

Stop Re-running Everything: A Local Incremental Pipeline in DuckDB

Marko — Sat, 10 Jan 2026 14:26:38 +0000

I love local-first data work… until I catch myself doing the same thing for the 12th time:

“I changed one model. Better rerun the whole pipeline.”

This post is a light walkthrough of a tiny project that fixes that habit using incremental models + cached DAG runs — all on your laptop with DuckDB. The example is a simplified, DuckDB-only version of the existing incremental_demo project.

We’ll do three runs:

seed v1 → initial build
run again unchanged → mostly skipped
seed v2 (update + new row) → incremental merge/upsert

That’s it. No cloud. No ceremony.

The whole demo in one sentence

We seed a tiny raw.events table (from CSV), build a staging model, then build incremental facts that only process “new enough” rows based on updated_at, and apply updates based on event_id.

What’s in the mini project

There are three key pieces:

1) Two seed snapshots

v1 has 3 rows.
v2 changes event_id=2 (newer updated_at, different value) and adds event_id=4.

2) A source mapping

The project defines a source raw.events pointing at a seeded table called seed_events.

3) A few models (SQL + Python)

events_base (staging table)
fct_events_sql_inline (incremental SQL, config inline)
fct_events_sql_yaml (incremental SQL, config in project.yml)
fct_events_py_incremental (incremental Python model for DuckDB)

All of these exist in the exported demo.

DuckDB-only setup

The demo’s DuckDB profile is simple: it writes to a local DuckDB file.

`profiles.yml` (DuckDB profile)

dev_duckdb:
  engine: duckdb
  duckdb:
    path: "{{ env('FF_DUCKDB_PATH', '.local/incremental_demo.duckdb') }}"

`.env.dev_duckdb` (optional convenience)

FF_DUCKDB_PATH=.local/incremental_demo.duckdb
FF_DUCKDB_SCHEMA=inc_demo_schema

The models (the fun part)

Staging: `events_base`

This is intentionally boring: cast timestamps, keep columns tidy.

{{ config(materialized='table') }}

select
  event_id,
  cast(updated_at as timestamp) as updated_at,
  value
from {{ source('raw', 'events') }};

Incremental SQL (inline config): `fct_events_sql_inline`

This model declares:

materialized='incremental'
unique_key='event_id'
watermark column: updated_at

And on incremental runs it only selects rows newer than the max updated_at already in the target.

This assumes updated_at increases when rows change (it’s a demo; real pipelines may need late-arrival handling).

{{ config(
    materialized='incremental',
    unique_key='event_id',
    incremental={ 'updated_at_column': 'updated_at' },
) }}

with base as (
  select *
  from {{ ref('events_base.ff') }}
)
select
  event_id,
  updated_at,
  value
from base
{% if is_incremental() %}
where updated_at > (
  select coalesce(max(updated_at), timestamp '1970-01-01 00:00:00')
  from {{ this }}
)
{% endif %};

Incremental SQL (YAML-config style): `fct_events_sql_yaml`

Same idea, but the SQL file stays “clean”:

{{ config(materialized='incremental') }}

with base as (
  select *
  from {{ ref('events_base.ff') }}
)
select
  event_id,
  updated_at,
  value
from base;

…and the incremental knobs live in project.yml:

models:
  incremental:
    fct_events_sql_yaml.ff:
      unique_key: "event_id"
      incremental:
        enabled: true
        updated_at_column: "updated_at"

Pick whichever style better suits you.

Incremental Python (DuckDB): `fct_events_py_incremental`

This one just adds value_x10 in pandas, and returns a delta frame. The incremental behavior (merge/upsert) is configured in project.yml for the model.

from fastflowtransform import engine_model
import pandas as pd

@engine_model(
    only="duckdb",
    name="fct_events_py_incremental",
    deps=["events_base.ff"],
)
def build(events_df: pd.DataFrame) -> pd.DataFrame:
    df = events_df.copy()
    df["value_x10"] = df["value"] * 10
    return df[["event_id", "updated_at", "value", "value_x10"]]

The three-run walkthrough

We’ll follow the demo’s exact “story arc”: first build, no-op build, then seed changes triggering incremental updates.

Step 0: pick a local seeds folder

The Makefile uses a local seeds dir and swaps seed_events.csv between v1 and v2.

mkdir -p .local/seeds

A tiny dataset that still proves “incremental”

This demo uses two versions of the same seed file. v2 updates one existing row and adds one new row — so you can watch an incremental model do both an upsert and an insert.

seeds/seed_events_v1.csv

event_id,updated_at,value
1,2024-01-01 00:00:00,10
2,2024-01-02 00:00:00,20
3,2024-01-03 00:00:00,30

seeds/seed_events_v2.csv

event_id,updated_at,value
1,2024-01-01 00:00:00,10
2,2024-01-05 00:00:00,999
3,2024-01-03 00:00:00,30
4,2024-01-06 00:00:00,40

After you switch from v1 → v2 and run again, you should end up with:

event_id=2 updated (newer updated_at, value=999)
event_id=4 inserted (brand new row)

1) First run (seed v1 → initial build)

Copy v1 into place:

cp seeds/seed_events_v1.csv .local/seeds/seed_events.csv

Seed + run:

FFT_SEEDS_DIR=.local/seeds fft seed . --env dev_duckdb
fft run . --env dev_duckdb --cache=rw

What you should expect:

events_base becomes a normal table
incremental models create their target tables for the first time (it’s effectively a full build the first time)

2) No-op run (same seed v1; should be mostly skipped)

Run again without changing anything:

fft run . --env dev_duckdb --cache=rw

The demo literally calls this the “no-op run… should be mostly skipped,” which is the best feeling in local data dev.

3) Change the seed (v2 snapshot) and run incremental

Now swap to v2:

cp seeds/seed_events_v2.csv .local/seeds/seed_events.csv
FFT_SEEDS_DIR=.local/seeds fft seed . --env dev_duckdb
fft run . --env dev_duckdb --cache=rw

This is the punchline:

event_id=2 comes in with a newer updated_at and value=999
event_id=4 shows up for the first time

So your incremental facts should update the row for event_id=2 and insert event_id=4, based on unique_key=event_id.

Sanity check in DuckDB

Query the incremental SQL table:

duckdb .local/incremental_demo.duckdb \
  "select * from inc_demo_schema.fct_events_sql_inline order by event_id;"

After v2, you should see:

event_id=2 with updated_at = 2024-01-05 and value = 999
a new row for event_id=4 with updated_at = 2024-01-06 and value = 40

If you query the Python table:

duckdb .local/incremental_demo.duckdb \
  "select * from inc_demo_schema.fct_events_py_incremental order by event_id;"

You should also see value_x10 (e.g., 9990 for the updated row).

Make the DAG visible

You can see the DAG in the docs:

fft docs serve --env dev_duckdb --open

Optional tiny “quality check”

The demo includes simple not-null tests for the incremental outputs.

fft test . --env dev_duckdb --select tag:incremental

What you just bought yourself

With this setup, your local dev loop becomes:

Run once to build everything
Run again and skip most work
Change input data and update only what’s necessary
Update existing rows safely (via unique_key) instead of “append and pray”

And all of it works with a single local DuckDB file, which makes experimenting feel cheap again.

Stop Waiting for the Cloud: Building a Hybrid SQL+Python Data Pipeline Locally with DuckDB

Marko — Fri, 28 Nov 2025 16:59:05 +0000

Cloud data warehouses are amazing for production. They are terrible for development.

If you’re a Data Engineer, you know the pain of the “cloud feedback loop”:

You write a complex SQL query.
You hit “Run” in your orchestrator.
You wait 45 seconds for the warehouse to spin up or queue your job.
It fails because of a syntax error.
You fix it. You pay for the query slot. You wait again.

This latency kills flow.

In software engineering, we run code locally on our laptops before shipping to production. Why can’t we do the same for data pipelines?

I built FastFlowTransform (FFT) to solve this. It’s a framework that lets you build and test your pipeline locally using DuckDB (for speed and free compute), and then deploy the same project to Snowflake, BigQuery, or Databricks for production.

In this post we’ll build a tiny “Users” pipeline that:

runs locally on DuckDB in < 1s, and
can be deployed to BigQuery by changing a single CLI flag (--env).

1. Initialize a local-first project (no cloud creds)

You don’t need AWS keys or a Snowflake login. Just a laptop.

pip install fastflowtransform

fft init building_locally_demo --engine duckdb
cd building_locally_demo

This creates a minimal FFT project with:

models/ – SQL/Python models
seeds/ – CSV/Parquet seeds
profiles.yml – connection profiles, including a local DuckDB one

We’ll use DuckDB as our “dev warehouse”.

2. Add some data (seeds + sources)

We start with a simple users CSV.

seeds/seed_users.csv:

id,email,signup_date
1,alice@example.com,2023-01-01
2,bob@example.com,2023-01-02

Then we tell FFT how to reference this seed as a source.

sources.yml:

version: 2

sources:
  - name: raw
    schema: staging
    tables:
      - name: users
        identifier: seed_users

Now {{ source('raw', 'users') }} will resolve to the table created from seed_users.csv.

Load the seed into DuckDB:

fft seed . --env dev_duckdb

You should now have:

a DuckDB file at .local/dev.duckdb
a table like staging.seed_users available in your local engine

3. Write a transformation in SQL

FFT uses standard SQL with Jinja templating (similar to dbt). It takes care of engine differences for you.

Create models/staging/stg_users.ff.sql:

{{ config(
    materialized='table'
) }}

select 
    id, 
    -- Standardizing email casing
    lower(email) as email, 
    -- Casting types explicitly
    cast(signup_date as date) as signup_date
from {{ source('raw', 'users') }}

This is just… SQL. No FFT-specific magic beyond config() and source().

4. Run the pipeline locally (the “fast loop”)

Now run the DAG on your machine:

fft run . --env dev_duckdb

On my laptop, I see something like:

Time: 18 ms
Cost: $0.00
Infrastructure: my CPU

I can iterate on this loop hundreds of times per hour.

Add data quality checks:

fft test . --env dev_duckdb

Or even model-level unit tests (no real DB needed):

fft utest . --env dev_duckdb

Local-first DX: fast feedback, offline-friendly, and you only touch the cloud once your logic is solid.

5. Point the same project at BigQuery (the “flow loop”)

Once you’re happy with the logic, it’s time to push to production.

In other frameworks you might:

change connection strings,
update environment variables manually,
maybe even rewrite SQL if you used engine-specific functions.

In FFT, you just switch the profile (and do not forget to set your credentials for BigQuery).

profiles.yml:

# My Local Playground
dev_duckdb:
  engine: duckdb
  duckdb:
    path: ".local/dev.duckdb"

# My Local utest Overrides
dev_duckdb_utest:
  engine: duckdb
  duckdb:
    path: ":memory:"

# My Production Environment
prod_bigquery:
  engine: bigquery
  bigquery:
    project: "fft-basic-demo"
    dataset: "production_marts"
    location: "EU"
    # Use the pandas backend here; FFT can also use BigFrames if you set this to true.
    use_bigframes: false
    allow_create_dataset: true

# My Production utest Overrides
prod_bigquery_utest:
  engine: bigquery
  bigquery:
    dataset: "production_marts_utest"

Now run the same project, different environment:

# exactly the same command, different --env
fft seed . --env prod_bigquery
fft run . --env prod_bigquery

FFT:

builds the same DAG,
compiles the same SQL models,
authenticates with Google Cloud using your local creds,
executes the transformations on BigQuery.

We didn’t touch stg_users.ff.sql. Only --env changed.

6. Why this is a big deal for DX

This isn’t just about saving money (though you do get that for free). It’s about Developer Experience.

Work offline. Build complex DAGs on a train or in airplane mode with DuckDB.
Unit test your models. Use fft utest with tiny fixtures to validate logic before hitting any real warehouse.
Hybrid SQL+Python. FFT supports Python models alongside SQL. Use SQL for aggregations and joins, Python for custom logic or ML.

Example Python model:

# models/marts/mart_latest_signup.ff.py
import pandas as pd
from fastflowtransform import engine_model


@engine_model(
    # Register this model for both DuckDB (local) and BigQuery (pandas backend)
    only=("duckdb", "bigquery"),
    name="mart_latest_signup",
    materialized="table",
    deps=["stg_users.ff"],  # SQL model from earlier in the article
    tags=[
        "scope:mart", 
        "engine:duckdb",
        "engine:bigquery"
    ],
    requires={
        # Columns produced by stg_users.ff.sql:
        #   id, email, signup_date
        "stg_users.ff": {"id", "email", "signup_date"},
    },
)
def build(stg_users: pd.DataFrame) -> pd.DataFrame:
    """Return the latest signup per email domain using pandas."""

    # Derive an email_domain column in Python
    users = stg_users.copy()
    users["email_domain"] = users["email"].str.split("@").str[-1]

    latest = (
        users.sort_values("signup_date", ascending=False)
        .drop_duplicates("email_domain")  # keep the newest per domain
        .loc[:, ["email_domain", "id", "email", "signup_date"]]
        .rename(
            columns={
                "id": "latest_user_id",
                "email": "latest_email",
                "signup_date": "latest_signup_date",
            }
        )
        .reset_index(drop=True)
    )

    return latest

You can run this locally (DuckDB + pandas), then later switch to Spark or BigQuery with the same decorator.

7. Try it yourself

Stop treating your laptop like a thin client. Your machine is powerful enough to build data pipelines.

pip install fastflowtransform

fft init building_locally_demo --engine duckdb
cd building_locally_demo

# add the users seed + sources.yml from this article,
# then:
fft seed . --env dev_duckdb
fft run . --env dev_duckdb

Repo:
https://github.com/FFTLabs/FastFlowTransform

If you try this, I’d love to hear:

What does your workflow look like
Which warehouse you’re deploying to (BigQuery/Snowflake/Databricks/etc.)

Thinking about a follow-up post on incremental models or data-quality tests with FFT — if that’s interesting, tell me in the comments.

The Offline Data Engineer: Building Resilient API Pipelines that Work on an Airplane

Marko — Fri, 21 Nov 2025 17:00:20 +0000

Development loops for API integrations are usually painful.

We’ve all been there: You are building a data pipeline to ingest data from a third-party API (Salesforce, Stripe, or an internal microservice). You write your Python script, hit run, and wait.

It works. You change a column name in your transformation logic. You hit run again. You wait again.

Suddenly, you hit a rate limit. Or the API throws a 503 error. Or, worse, you are on a train or an airplane with spotty WiFi, and you can’t run your code at all because it depends on a live internet connection.

In the world of SQL, we solved this with local databases (DuckDB) and seeds. But in the world of Python API ingestion, we are often still writing fragile requests.get() loops that break the moment the internet does.

I built FastFlowTransform (FFT) to solve this. It’s a hybrid SQL+Python framework that treats HTTP responses like immutable artifacts, allowing you to build, test, and debug API pipelines completely offline.

Here is how to build a pipeline that is "Airplane Mode" ready, using a real API example.

The Problem: The "Request Loop" Antipattern

Typically, a Python extraction script looks something like this:

import requests
import pandas as pd

data = []
page = 1
while True:
    # Fragile: Depending on a live connection for every test run
    resp = requests.get(f"https://jsonplaceholder.typicode.com/todos?_page={page}&_limit=20")

    json_data = resp.json()
    if not json_data: # Stop if empty
        break

    # ... complex logic to clean and parse JSON ...
    data.extend(json_data)
    page += 1

df = pd.DataFrame(data)
# ... save to DB ...

The issues with this approach:

Coupled Extraction & Logic: If you mess up the parsing logic, you have to re-fetch everything to test the fix.
No State: If the script crashes on page 99 of 100, you restart from page 1.
Online Only: You cannot run this without a live connection.

The Solution: FFT's Cached HTTP Module

FastFlowTransform introduces a dedicated module fastflowtransform.api.http. It separates the fetch (IO) from the transformation (Compute) using a file-backed cache.

Let's build a model that pulls "To-Do" items from JSONPlaceholder.

1. The Setup

First, we initialize a project. We'll use DuckDB for local development so we don't need a cloud warehouse.

pip install fastflowtransform
fft init my_api_project --engine duckdb

2. The Model with Pagination

In FFT, Python models are first-class citizens. We need to define how to talk to the API.

Since JSONPlaceholder uses query parameters (_page and _limit), we write a simple paginator function that detects when to stop (when the list is empty) and how to get the next page.

Create models/todos_ingest.ff.py:

import pandas as pd
from fastflowtransform import model
from fastflowtransform.api.http import get_df

# 1. Define the Paginator
# This function runs after every request to determine what to do next.
def offset_paginator(url, params, response_json):
    # If the API returns an empty list, we are done.
    if not response_json:
        return None

    # Otherwise, increment the page number
    current_page = params.get("_page", 1)
    if current_page >= 2:
        return None
    next_params = dict(params or {})
    next_params["_page"] = current_page + 1
    return {
        "next_request": {
            "params": next_params
        }
    }

@model(name="todos_ingest")
def fetch_todos() -> pd.DataFrame:
    # 2. get_df handles the HTTP calls, caching, and conversion
    df = get_df(
        url="https://jsonplaceholder.typicode.com/todos",
        params={"_page": 1, "_limit": 10}, # Start at page 1
        paginator=offset_paginator,
        # record_path is None because the root of the JSON is the list itself
        record_path=None 
    )

    # 3. Apply transformation logic
    # If we change THIS logic later, FFT won't re-fetch the API!

    # Example: Mark high-priority items locally
    df["priority"] = df["title"].apply(lambda x: "HIGH" if "delectus" in x else "NORMAL")

    return df

3. The First Run (Online)

When we run this for the first time, FFT hits the API.

fft run . --select todos_ingest

What happens under the hood:

FFT calculates a fingerprint for the model.
It executes the requests: _page=1, then _page=2, and so on, until the API returns [] or it reaches the configured limits.
Crucially: It saves the raw JSON to a local cache directory (.fastflowtransform/http_cache).
It transforms the data and materializes a table in DuckDB.

4. The Second Run (Offline / "Airplane Mode")

Now, imagine you are on a plane. You realize you made a mistake: you want to filter out any tasks that are already completed.

You update the code:

    # ... inside the model function ...

    # New Logic: Filter rows
    df = df[df["completed"] == False]

    # New Logic: Uppercase titles
    df["title"] = df["title"].str.upper()

    return df

You don't have internet. But you don't need it. Run with the --offline flag:

fft run . --select todos_ingest --offline

The Result:

FFT sees the --offline flag.
It checks the cache. It finds the JSON from the previous run.
It skips the network request entirely.
It feeds the cached JSON into your new logic.
The run succeeds in milliseconds.

5. Telemetry and Observability

How do you know if you hit the cache? FFT generates a run_results.json artifact after every run. It provides deep visibility into your API consumption:

"http": {
    "bytes": 2273,
    "cache_hits": 2,
    "content_hashes": [
        "sha256:110aa4d5dac630aa245ff3c3c53d7ea9bc4212df93f04d96f900ba9cb93f4622",
        "sha256:27ea31b0b9bb05c4feba2951d2f0a5f9dde340f0d19cc45722386e8951b794b5"
    ],
    "keys": [
        "7a8d720efd2b8afb319534d0d1f08b7937f666a14fea0952c3cbbe0c2442b6d9",
        "24850fd6c24df9ecd041d643331023d48d39b6c6bbf64080c76f86c95613a584"
    ],
    "node": "todos_ingest",
    "requests": 2,
    "used_offline": true
}

This gives you confidence that your CI/CD pipeline is deterministic. You can even commit your cache to git (for small reference datasets) to ensure your tests never flake due to external API downtime.

Why this matters

Data engineering is moving toward software engineering best practices.

Reproducibility: Your pipeline should produce the same result today as it did yesterday, regardless of the state of an external API.
Speed: You shouldn't pay a latency penalty every time you test a logic change.
Cost: If you are hitting a paid API, caching saves you money during development.

FastFlowTransform brings the developer experience of "Localhost" to the messy world of Data Engineering.

Try it out:
[FastFlowTransform GitHub]

pip install fastflowtransform