Forem: Edwardvaneechoud

Flowfile v0.8.0 — Your Flows Can Run Themselves Now

Edwardvaneechoud — Thu, 26 Mar 2026 07:15:21 +0000

Full disclosure: I built Flowfile. This post is about a feature I just shipped and why I think it matters.

Up until this release, Flowfile was a tool you designed pipelines in. You'd build a flow visually, maybe export the code, and then figure out how to actually run it on a schedule. That "figure out" part usually meant cron, Airflow, or a very optimistic shell script.

v0.8.0 changes that. Flowfile now has a built-in scheduler. You can set an interval, trigger flows when data updates, or just hit "Run" from the catalog. No external orchestrator needed.

I'm not going to pretend this competes with Airflow or Dagster for serious production workloads. It doesn't. But if you've ever had a lightweight data platform where you just needed a few flows to run on a timer without setting up a whole orchestration layer — this is that.

The Part I'm Most Excited About: Table Triggers

Interval scheduling is useful but not interesting. "Run every 30 minutes" is solved. What I actually want to talk about is table triggers.

The idea is simple: instead of running a flow on a timer, you run it when the data it depends on changes. A Catalog Writer node overwrites a table, and any flow watching that table starts automatically.

In practice, this means you can chain flows together through data. Flow A produces a cleaned customer table. Flow B, which reads that table, runs as soon as Flow A finishes writing it. No coordination logic, no polling intervals to tune, no "let's just run everything at 3am and hope the order works out."

Three Types, and Why

Table trigger — a flow runs when a single table is refreshed. The obvious case. Your flow reads orders_raw, you set a trigger on orders_raw, and the flow runs every time that table gets new data.

Table set trigger — a flow runs when all tables in a set have been refreshed. This one would have saved me real time on past projects. Picture this: you have a reporting flow that joins three source tables — sales, inventory, and returns. You never want the report to run until all three are fresh. Without table set triggers, you solve this with retry logic, completion flags, or by scheduling everything sequentially and adding generous wait times. With table set triggers, you just list the three tables and the flow runs once all three have been updated. Declarative, no coordination code.

Unrestricted table trigger — a flow can also trigger on a table it doesn't read. I debated whether to restrict triggers to only tables the flow actually depends on. Seemed cleaner. But I kept coming back to the same thought: I can't predict every use case people will have. Maybe you want a notification flow to fire whenever a table updates. Maybe you have a cleanup job that should run after any write to a particular namespace. I don't know. So I left it open.

How the Scheduler Works

Three modes, depending on how you run Flowfile:

If you're on the desktop app or pip install, the scheduler runs inside the Flowfile process. Start and stop it from the Schedules tab. Simple.

If you want it running independently:

flowfile run flowfile_scheduler

That gives you a standalone background service that survives UI restarts. Only one scheduler instance runs at a time — there's an advisory database lock with a heartbeat. If a scheduler dies, another can take over after 90 seconds.

For Docker, add one environment variable:

environment:
  - FLOWFILE_SCHEDULER_ENABLED=true

One thing I got right early: each flow can only have one active run at a time. If a flow is already running, new triggers are skipped. This prevents the cascade of overlapping runs that makes scheduling systems miserable to debug.

Running Flows Without a Schedule

Not everything needs to be automated. Sometimes you just want to run a registered flow right now and see what happens.

v0.8.0 adds a Run Flow button directly in the catalog's flow detail panel. It spawns a subprocess, tracks the run in your history, and writes logs to ~/.flowfile/logs/. You can cancel a running flow at any time — it sends SIGTERM to the process.

When flows are running, an active runs banner appears at the top of the catalog. You can see what's running, when it started, and cancel anything that's taking too long.

Being Honest About Scope

This scheduler is not Airflow. It doesn't have retries with exponential backoff. It doesn't have DAG-level dependency management across dozens of flows. It doesn't have a distributed executor.

What it does have: zero additional infrastructure. If you can pip install flowfile, you have a scheduler. If you can run Docker Compose, you have a scheduler. For a team that needs three flows running on an interval and two more triggered by data updates, that's enough. For a lightweight data platform that's still figuring out what it needs, this could be the first proof of concept before anyone decides whether Airflow is worth the setup.

And if you outgrow it — that's fine. Flowfile still generates standalone Python code. The flows don't lock you in, and neither does the scheduler.

Where this is heading: table triggers are a step, not the destination. The real goal is that you shouldn't be thinking about schedules at all. You should define how fresh your data needs to be, and the system figures out the rest. The catalog already tracks what each flow reads and writes, when tables were last updated, and how flows relate to each other. The pieces are there. Schedules are the scaffolding — data freshness is the building.

Try It

pip install flowfile

If you have questions or feedback, I'd genuinely love to hear it. Especially if you have a use case for table triggers I haven't thought of — that's exactly why I left them unrestricted.

Build Your Own DataFrame: a course based on an engine I probably shouldn't have written

Edwardvaneechoud — Fri, 20 Mar 2026 17:47:30 +0000

A few years ago, I needed a data processing engine for a visual ETL tool I was building — Flowfile — and against all sane practices, I just started writing one. Pure Python. itertools.groupby for aggregation. operator.itemgetter for column access, own type inference, manual memory optimization, custom everything.

It handled joins, pivots, groupby, explode, filters — a working engine built entirely on the standard library. No numpy, no C extensions, no dependencies at all.

Was this a good idea? Probably not the most efficient path. But it taught me something I couldn't have learned any other way: I understood exactly what a dataframe library does, because I'd built every piece of one myself.

When I eventually migrated Flowfile's engine to Polars, the pure Python engine went into a drawer. That migration was driven by something I realized about focus: you can't do everything. Building a custom dataframe engine was a great way to learn, but it was a terrible way to ship a product. Flowfile needed things that actually mattered — the visual editor, the code generation, the user experience — not maintaining a homegrown query engine. Ironically, I'm now doing more for Flowfile than ever again. But I believe/hope (lol) they're the important things.

Sometimes it's fun to look back at your old code — especially the code from before AI — and see how you were solving problems back then. The old engine was all me: no Copilot, no autocomplete suggestions, just reading docs and figuring it out. I kept thinking it would be fun to turn it into something people could learn from. So I cleaned it up, published it as pyfloe, and wrote a course around it — one where the why always comes before the how.

The problem with "how libraries work" content

Most educational content about library internals falls into two buckets.

The first is the conceptual walkthrough of production internals. "Let's look at how Polars implements joins!" And then you're reading about Rust iterators, SIMD intrinsics, and memory layouts — real concepts, but at a level of abstraction where it's hard to connect the ideas back to something you could build or modify yourself. You come away knowing the pieces exist without quite seeing how they fit together.

The second is the toy example. Build a 50-line DataFrame class with __init__, filter, and select. Wave your hands. "Real libraries do something like this, but more complicated." Clean mental model, almost no connection to how things actually work.

Neither approach works for me. I've always learned the most from implementing things that needed to function — not reading about them, not building throwaway demos, but building something real enough that the architectural decisions actually matter.

Two trees, one engine

That's what pyfloe is. A lazy dataframe engine that implements simplified versions of the patterns behind real query engines — expression ASTs, the volcano execution model, hash joins, a rule-based optimizer — in about six files of readable Python.

Here's what using it looks like:

import pyfloe as pf

result = (
    pf.read_csv("orders.csv")
    .filter(pf.col("amount") > 100)
    .with_column("rank", pf.row_number()
        .over(partition_by="region", order_by="amount"))
    .select("order_id", "region", "amount", "rank")
    .sort("region", "rank")
    .collect()
)

If you know Polars, this looks familiar. That's intentional.

The core insight — the thing that makes everything in pyfloe (and Polars, and Spark) click — is that the engine is built around two tree structures.

The plan tree describes how data flows. Each node is an operation: scan a file, filter rows, join two tables, project columns. When you call .collect(), the engine walks this tree from the root and pulls data upward.

The expression tree describes what to compute on each row. When you write pf.col("amount") > 100, Python's __gt__ method doesn't compare anything — it builds a BinaryExpr node with a Col("amount") on the left and a Lit(100) on the right.

Here's the trick. You override __gt__ so it returns a new object instead of a boolean:

class Expr:
    def __gt__(self, other):
        return BinaryExpr(self, _ensure_expr(other), operator.gt, '>')

Now col("amount") > 100 doesn't evaluate — it builds a tree. And because that tree is data, you can walk it:

expr = (col("price") * col("quantity")) > 1000
expr.required_columns()
# → {"price", "quantity"}

Two columns out of fifty. The optimizer now knows it can tell the scan node to skip the other forty-eight.

Expressions live inside plan nodes. A FilterNode doesn't hold a lambda — it holds an expression tree. And because that expression is inspectable, the optimizer can figure out which columns it needs and rewrite the plan to eliminate unnecessary work.

This is the same general approach Polars uses — though Polars goes much further with cost-based optimization, parallelism, and columnar memory. The difference is that in pyfloe, you can open plan.py and read the whole thing in twenty minutes.

What the course covers

I wrote a free course around pyfloe: Build Your Own DataFrame.

The whole thing runs on Pyodide — Python in the browser via WebAssembly. Every module has interactive code blocks where you write actual Python, run it, and see the results. When "trying it" is just a click, more people actually try it.

Five modules, each building on the last. You start with eager vs. lazy execution and the volcano model. Then you hijack Python's dunder methods so that col("x") + col("y") builds an inspectable tree instead of adding anything. Then you plug expressions into plan nodes, implement hash joins and aggregation, and end with the optimizer — filter pushdown and column pruning, where the plan tree and expression trees finally converge.

You build simplified versions of each layer, then compare to the real pyfloe source.

Honest boundaries

If it needs saying: pyfloe is not competing with Polars on performance. It can't — Python is orders of magnitude slower for this kind of work. Polars is one of those libraries that changed the way people interact with data in Python, and it's on the forefront of how a dataframe API should work. pyfloe implements the same structures in pure Python — so you can read them, break them, and understand them.

Where it makes sense, we simplify. The optimizer is rule-based, not cost-based. Real engines like Polars estimate row counts and pick join strategies dynamically. pyfloe applies fixed rules — push filters down, prune unused columns — which is enough to show you why optimizers exist and how they rewrite plan trees.

The course also doesn't try to cover everything. Streaming I/O is in a bonus section, not the main path. Five modules, focused on the core architecture.

Why this matters to me

I like understanding how things work. That's why I built the engine from scratch in the first place, and it's why I turned it into a course instead of just deleting it. Flowfile ended up being the tool I actually use — a visual ETL platform where you can build pipelines, inspect data at every step, and export clean Polars code. pyfloe is the original engine that made Flowfile possible, stripped down to where you can see every moving part.

If you're the kind of person who wants to know what a dataframe library actually does when you call .filter() or .group_by(), the course is free and the code is all Python you can read. That's the whole pitch.

Course: Build Your Own DataFrame
Library: pyfloe on GitHub
The tool that started it: Flowfile on GitHub

Stop Drawing ETL Diagrams — Your Python Code Visualizes Itself

Edwardvaneechoud — Thu, 29 May 2025 16:26:48 +0000

Ever wished you could write Python code and get the clarity of a visual data flow? That's exactly what Flowfile offers with FlowFrame — a Polars-like API that silently builds a visual ETL graph as you code.

The problem we're solving

As data engineers and scientists, we often find ourselves in one of two camps:

Writing complex Python code that's powerful but hard to explain
Drawing diagrams that are clear but disconnected from actual implementation

What if your code could be both powerful AND self-documenting?

Enter FlowFrame: code that visualizes itself

FlowFrame, part of the open-source Flowfile project, bridges this gap by combining the precision of Python coding with the clarity of visual ETL pipelines.

Key benefits:

Write familiar, Polars-style Python code
Automatically generate visual pipelines you can view, edit, and share
Seamlessly switch between code and visual interfaces
Full LazyFrame API support (new in v0.3.3!)

Getting started

Install Flowfile with a single command:

pip install flowfile

See it in action

Let's explore with a practical example using a sales dataset:

import flowfile as ff
from flowfile import col, open_graph_in_editor, when, lit

# Read the sales data
sales = ff.read_csv("files/Superstore Sales Dataset.csv", separator=",")

# Helper function to clean column names
def correct_column_names(df: ff.FlowFrame) -> ff.FlowFrame:
    columns = df.columns
    return df.select((ff.col(c).alias(c.replace(" ", "_").lower()) for c in columns),
                     description="Column names to lowercase")

sales_clean = correct_column_names(sales)

# Advanced transformations with the FlowFrame API
transformed = (
    sales_clean
    .with_columns([
        # Calculate shipping time in days
        (col("ship_date") - col("order_date")).dt.total_days().alias("shipping_days"),
        # Extract year from order date for trend analysis
        col("order_date").dt.year().alias("order_year"),
        # Create sales tiers based on order value
        when(col("sales") < 50).then(lit("Small"))
        .when(col("sales") < 200).then(lit("Medium"))
        .when(col("sales") < 500).then(lit("Large"))
        .otherwise(lit("Enterprise"))
        .alias("order_tier"),
        # Clean up state codes (handle nulls)
        col("state").fill_null("Unknown").alias("state_clean")
    ], description='Create order characteristics')
    .filter(col("shipping_days") >= 0, description="Remove invalid shipping times")
    .group_by(["category", "segment", "order_tier"])
    .agg([
        col("sales").sum().alias("total_sales"),
        col("sales").mean().alias("avg_order_value"),
        col("shipping_days").mean().alias("avg_shipping_days"),
        col("order_id").n_unique().alias("unique_orders"),
        col("customer_id").n_unique().alias("unique_customers")
    ])
    .with_columns([
        # Calculate customer concentration ratio
        (col("unique_customers") / col("unique_orders") * 100)
        .round(1).alias("customer_concentration_pct")
    ], description='Create nice readable percentages')
    .filter(col("total_sales") > 1000, description='Filter on significant segments')
    .sort(["total_sales", "avg_order_value"], descending=[True, True])
)

# The magic happens here — visualize your pipeline!
open_graph_in_editor(transformed.flow_graph)

This launches the Flowfile Designer, showing your entire pipeline visually. Each node represents an operation, allowing you to:

Visualize data flow
Debug by inspecting results at each step
See data previews
Make visual edits that sync back to code

What's new in v0.3.3: full Polars LazyFrame support

The latest release brings massive improvements to the FlowFrame API:

1. Type safety

# Now with full type hints and autocompletion!
transformed.select(col("customer_concentration_pct").cum_sum())  # IDE knows all available methods

2. Dynamic expression methods

All Polars expression methods are now available:

# List operations
df.with_columns(col("tags").list.len().alias("tag_count"))

# String operations  
df.filter(col("email").str.contains("@company.com"))

# Datetime operations
df.with_columns(col("timestamp").dt.hour().alias("hour"))

# Statistical operations
df.select(col("value").rolling_mean(7).alias("weekly_avg"))

3. User-defined functions support

# Custom functions are now tracked in the graph!
from functions import custom_transform 

df.with_columns(
    col("value").map_batches(custom_transform).alias("transformed")
)

Real-world use cases

1. Explaining complex transformations

Ever tried explaining a complex data pipeline to non-technical stakeholders? With FlowFrame, build in Python, then share a visual flow everyone understands.

2. Debugging made visual

When something doesn't look right, seeing the transformation flow and inspecting data at each step makes troubleshooting much faster than combing through lines of code.

3. Team collaboration

# Data scientists can start in code
pipeline = create_complex_pipeline()
pipeline.save_graph("quarterly_analysis.flowfile")

# Analysts can continue visually in Flowfile Designer
# No code required!

What's next?

Flowfile is constantly evolving:

Tighter code-visual mapping: Every transformation accurately reflected in both directions
Visual-to-code generation: Generate clean Python code from visual designs
Domain-specific workflows: Specialized support for ML pipelines, time series analysis, and more
Cloud integrations: Direct connections to cloud data warehouses

Try it yourself

FlowFrame bridges the gap between code and visual ETL tools, offering a powerful way to build data pipelines that are both efficient and understandable.

# Your turn! Install and try it:
pip install flowfile

# Then run the example above or try your own data

The project is fully open-source on GitHub. We'd love your feedback, contributions, or ideas for new features!

Streaming data with FastAPI & Vue made easy

Edwardvaneechoud — Fri, 25 Apr 2025 17:36:51 +0000

Streaming data with FastAPI & Vue made easy

Want effortless live updates — stock prices, notifications, logs — in your web app without complex setup? Server-Sent Events (SSE) simplify this using standard HTTP. They provide an efficient server-to-client push mechanism, hitting a sweet spot between clunky polling and potentially overkill WebSockets for one-way data. This guide demonstrates how to implement SSE with FastAPI and Vue.js for responsive results.

TL;DR

Server-Sent Events (SSE) provide a simple way to stream real-time data from your server to web clients using standard HTTP. This guide demonstrates a minimal implementation with FastAPI and Vue.js, showing system RAM usage monitored in real-time. Get the code and try it yourself:

git clone https://github.com/Edwardvaneechoud/stream-logs-demo
git checkout feature/simple_example
cd stream-logs-demo
pip install fastapi uvicorn psutil
python stream_logs.py

Then open your browser to http://localhost:8000 and click "Start Monitoring RAM" to see it in action. And if you want all the details… read on!

What are Server-Sent Events (SSE)?

Server-Sent Events (SSE) is a standard web technology designed for this. It's a simple and efficient way to stream data over HTTP. SSE works by keeping a single HTTP connection open between the client and server after the initial request. This open connection allows the server to automatically push updates to the client whenever new information is available.

Compared to alternatives:

Polling: With polling, the client repeatedly asks the server for updates at regular intervals. This can lead to unnecessary network traffic and delayed data. SSE is more efficient because the server pushes updates only when there's new data — no need for the client to keep checking.
WebSockets: WebSockets are built for full-duplex communication — both the client and server can send messages independently at any time. This makes them ideal for real-time, interactive applications like live chat, multiplayer games, or collaborative tools where two-way communication is essential. But if your use case only requires the server to push updates — for example, live logs, real-time monitoring, or status feeds — WebSockets can be unnecessarily complex. That's where SSE shines.

Because SSE runs over regular HTTP, there's no need to introduce a new protocol — it just works like any other HTTP request. That means you can plug it directly into most existing setups with zero friction. If your app already talks over HTTP, you're ready to stream.

Implementing SSE in FastAPI

In FastAPI, the implementation relies on StreamingResponse combined with an asynchronous generator. Here's the essential core code from stream_logs.py that sets up the SSE endpoint (note: the full file also includes app setup, static file serving, etc.):

# From stream_logs.py
import asyncio
import json
import time
import psutil
from typing import AsyncGenerator
from fastapi import FastAPI
from fastapi.responses import StreamingResponse

# Assume 'app = FastAPI()' is defined elsewhere

# 1. SSE Message Formatter
async def format_sse_message(data: str) -> str:
    # Ensures data adheres to the 'data: ...\n\n' SSE format
    return f"data: {json.dumps(data)}\n\n"

# 2. Asynchronous Data Generator
async def stream_logs() -> AsyncGenerator[str, None]:
    # Initial log entry
    yield await format_sse_message("Starting system RAM monitoring...")

    # Send 20 RAM usage log entries
    for i in range(1, 21):
        # Get current timestamp
        timestamp = time.strftime("%Y-%m-%d %H:%M:%S", time.localtime())

        # Get RAM usage
        ram = psutil.virtual_memory()
        ram_percent = ram.percent
        ram_total = ram.total / (1024 * 1024 * 1024)  # Convert to GB
        ram_used = ram_total * ram_percent/100

        # Format log entry
        log_entry = f"{timestamp} - RAM usage: {ram_percent}% ({ram_used:.2f}GB / {ram_total:.2f}GB)"
        yield await format_sse_message(log_entry)

        await asyncio.sleep(1)  # One second delay between logs

    # Final log entry
    yield await format_sse_message("RAM monitoring completed")

# 3. FastAPI Endpoint using StreamingResponse
@app.get("/api/stream")
async def stream_logs_endpoint():
    # Return the StreamingResponse, feeding it the generator
    # and setting the essential media_type for SSE
    return StreamingResponse(
        stream_logs(), # The generator provides the data stream
        media_type="text/event-stream" # Identifies the stream as SSE
    )

Explanation:

format_sse_message: This utility function ensures each piece of data sent conforms to the SSE specification. The data: prefix followed by the message payload and two newline characters (\n\n) is the standard way to delimit messages in an event stream. Using json.dumps ensures the data is reliably encoded as a JSON string.
stream_logs: This async function is a generator because it uses the yield keyword. Instead of running to completion and returning a single result, it can pause its execution state with yield, sending back the specified value (here, the formatted message). When iterated over (by StreamingResponse), its execution resumes from where it left off until the next yield or the function completes. The await asyncio.sleep(1) simulates an asynchronous operation, like waiting for new data to become available.
@app.get("/api/stream"): This decorator defines the HTTP GET endpoint that clients will connect to initiate the SSE stream.
StreamingResponse(...): This FastAPI class is key for sending data incrementally. It's initialized with the result of calling the generator (stream_logs()), which gives it an iterator object. StreamingResponse pulls values from this iterator (triggering the generator's execution up to the next yield) and sends each value immediately over the network connection.
media_type="text/event-stream": Setting this specific IANA media type in the StreamingResponse is crucial. It sends the correct Content-Type header, signaling to the browser that this response is a Server-Sent Event stream, enabling the browser to use its built-in EventSource API to listen for and process the incoming messages.

Okay, now that we have the backend sending the stream, let's look at how the frontend receives and displays these updates using Vue.js and the browser's built-in EventSource API.

Frontend: Receiving Events with EventSource

Frontend Setup Note: For this demonstration, the frontend is kept intentionally simple. We use a single index.html file which includes Vue.js directly from a CDN. Basic styling is provided in style.css. The core logic for connecting to the SSE stream and updating the display resides in app.js. You can explore the full frontend code in the repository.

On the client side, the browser's built-in EventSource API is used to connect to the SSE stream and process incoming messages. While the full static/app.js example uses Vue.js for display, let's isolate the essential EventSource interactions:

1. Set the reactive property

Before we initiate any streaming, we define our reactive state and prepare variables in the setup() function. This is where Vue's Composition API comes into play:

// From static/app.js
const logLines = ref([]);        // Reactive array bound to the template
const isStreaming = ref(false);  // Tracks whether we're currently streaming
let eventSource = null;          // Will hold the SSE connection

Explanation:

logLines is a reactive array. Each time we push new data into it, Vue will automatically update the DOM—no manual re-rendering needed.
isStreaming tracks whether a stream is currently active. It's used to toggle UI state (like disabling the button while data is being streamed).
eventSource is declared here so it can be shared across functions (startStreaming, closeEventSource). It's initially null and gets initialized when we start the stream.

2. Define the start streaming Logic

This function runs when the user clicks the "Start Monitoring RAM" button. It clears existing logs, updates state, and opens the SSE connection.

const startStreaming = () => {
  logLines.value = [];           // Clear previous logs
  isStreaming.value = true;      // Update UI state
  // Create the SSE connection
  eventSource = new EventSource("/api/stream");
  // Set up message and error handlers (see next section)
};

Explanation: This function sets up the streaming session. The connection isn't live yet, but we're establishing it and preparing to listen for events.

3. Handle Incoming Events (onmessage)

The logic for handling messages from the server is defined inside the startStreaming function, immediately after the SSE connection is created. Here's what that looks like:

// Still inside startStreaming()

eventSource.onmessage = (event) => {
  try {
    const data = JSON.parse(event.data);
    logLines.value.push(data); // Add new log entry to the reactive array

    // If the server signals that monitoring has completed, stop streaming
    if (data === "RAM monitoring completed") {
      closeEventSource(); // cleanup the connection and reset the UI components
    }
  } catch (error) {
    console.error("Error parsing data:", error);
  }
};

eventSource.onerror = (error) => {
  console.error("EventSource error:", error);
  closeEventSource();
};

Explanation: Once the server begins sending messages, onmessage is triggered for each new piece of data. We parse the event, push the result to logLines, and let Vue's reactivity do the rest—your UI updates automatically. If the message is "RAM monitoring completed", we assume the stream is done and clean up by calling closeEventSource(). We also define onerror here to catch any disconnections or issues. This ensures we always close the stream gracefully and reset the UI state if something goes wrong.

💡 Reminder: This entire block lives inside the startStreaming() function, which is returned from setup() and triggered when the user clicks the "Start Monitoring RAM" button.

4. Rendering the Data in HTML

Finally, we connect the reactive Vue properties directly to our HTML using standard Vue template syntax. Take a look at the relevant snippet from index.html:

<button @click="startStreaming" :disabled="isStreaming">
  Start Monitoring RAM
</button>
<div class="log-container">
  <div v-for="(line, index) in logLines" :key="index" class="log-line">
    {{ line }}
  </div>
</div>

Explanation:

The @click directive on the button binds the startStreaming function to the click event.
:disabled="isStreaming" disables the button while the stream is active.
The v-for loop dynamically renders each message in logLines as a new row in the UI.

📁 You can view the complete HTML structure in index.html in the repo if you'd like to see how the full layout is styled and structured.

Running the Demo Locally

Now that we've walked through the backend setup and the frontend logic, it's time to see the demo in action.

Clone the repository and move into the directory:

git clone https://github.com/Edwardvaneechoud/stream-logs-demo
cd stream-logs-demo
git checkout feature/simple_example

Install the required dependencies and start the server:

pip install fastapi uvicorn psutil
python stream_logs.py

Now open your browser and navigate to http://localhost:8000. Click the "Start Monitoring RAM" button and watch the RAM usage update in real-time! If you look at your terminal where FastAPI is running, you'll notice in the logs that there is only one request (GET /api/stream) for the stream itself, however, your website is updating!

You can explore the complete implementation of this demo at github.com/Edwardvaneechoud/stream-logs-demo/tree/feature/simple_example or checkout the more advanced example: github.com/Edwardvaneechoud/stream-logs-demo/

Stay Tuned!

This simple example demonstrates the core principles of Server-Sent Events. In my next post, I'll show how you can use SSE to stream real-time log information directly from your Python applications by implementing a custom logging handler. You'll learn how to capture and stream logs from any Python process without writing to files, making it easier to monitor and debug your applications in real-time. You'll see exactly how these principles scale to monitor complex data flows without polling.

If you found this helpful, follow me to catch the upcoming deep-dive where I'll break down the monitoring implementation from the main branch, complete with code examples and performance insights from a real production environment.

Next Steps

Extend the monitor to track CPU usage and disk I/O
Build a real-time dashboard with multiple system metrics
Try to handle different type of logs (warning, error, info, debug)

Happy streaming!

If you're interested in seeing Server-Sent Events applied in a more complex, real-world scenario, you can explore my personal project, Flowfile. I utilize SSE extensively in that project, specifically for efficiently streaming live logs from running data pipeline processes directly into the UI console, providing immediate feedback much like the examples discussed here. You can check out the implementation and the project on GitHub: https://github.com/Edwardvaneechoud/Flowfile

Building Flowfile: Architecting a Visual ETL Tool with Polars

Edwardvaneechoud — Sat, 19 Apr 2025 13:32:40 +0000

Flowfile is my open-source project, born from trying to build an Alteryx alternative that combined two things I like: the straightforward visual style of drag-and-drop tools, and the challenge plus creative freedom you get coding with powerful libraries like Polars. Building it taught me a ton, particularly about getting the architecture right. Since it was such an educational ride, I wanted to break down some of the technical bits here.

So, for my first article here on dev.to, I'm excited to do just that – dive into those technical bits! This piece goes behind the scenes of Flowfile's architecture to reveal how data actually flows through the system, from the moment you drag a node onto the canvas right through to final execution. You'll learn about:

How nodes and connections form a directed acyclic graph (DAG)
The schema prediction system that provides real-time feedback without execution
How execution modes leverage Polars' lazy evaluation for performance
The worker-based execution model that optimizes memory usage
Efficient inter-process communication using Apache Arrow IPC

By understanding these components, you'll gain insight into the design decisions made within Flowfile to balance user experience with performance.

The architecture: Three interconnected services

Flowfile operates as three separate but interconnected components:

Designer (Electron + Vue): The visual interface users interact with.
Core (FastAPI): The ETL engine that manages workflows, schema predictions, and orchestrates data transformations.
Worker (FastAPI): Handles heavy computations and data caching.

Let's walk through what happens at each stage of building and executing a data pipeline.

Building a data flow: What happens when you drag a node?

When you drag a node onto the canvas in the Designer, Flowfile initiates a series of coordinated actions between the front-end and back-end:

Node creation: The Designer sends a request to the Core service to create a new node of the specified type.
Graph update: The Core service registers this node by creating a NodeStep instance in the workflow graph with placeholder functionality.
Connection setup: As the user connects the new node to existing nodes, the Designer communicates these connections to the Core service.
Input schema inference: Based on the upstream nodes, the Core service infers the input data schema and sends it back to the Designer, allowing the user to immediately see what data is available for the selected action.
Node configuration: The user configures the node’s settings in the Designer (e.g., selecting columns, specifying transformations).
Schema prediction: These settings are sent to the Core service, which uses them to calculate the predicted output schema.
Schema feedback: The predicted schema is returned to the Designer and displayed in the interface, giving users real-time feedback on how their data will be shaped—before any execution occurs.
Iteration: When the user adds the next node, the cycle starts again from step 1, allowing complex workflows to be built step-by-step with continuous feedback.

The constant interaction between the Core service and the designer decreases the chance of user errors and increases predictability. This schema prediction is one of Flowfile's most powerful features.

# Core schema prediction function (simplified)
# Determines output structure without executing transformations
def get_predicted_schema(self) -> List[FlowfileColumn]:
    """Return the predicted schema for this node without executing the transformation."""
    if self.schema_callback is not None:
        # Use custom schema prediction if available
        return self.schema_callback()
    else:
        # Otherwise, infer from lazy evaluation by getting the schema from the plan
        predicted_data = self._predicted_data_getter() # Gets the Polars LazyFrame plan
        return predicted_data.schema # Polars can get schema from a LazyFrame

Schema prediction: How Flowfile knows your data structure in advance

Schema prediction occurs through one of two mechanisms:

Schema Callbacks: Custom functions, defined per node type, that calculate the expected output schema based on node configuration and input schemas.
Lazy Evaluation: Utilizing Polars' ability to determine the schema of a planned transformation (LazyFrame) without processing the full dataset.

This approach provides immediate feedback to users about the structure of their data pipeline without requiring expensive computation.

For example, when you set up a "Group By" operation, a schema callback can tell you exactly what the output columns will be based on your aggregation choices, before processing any data:

# Example of a schema callback for a Group By node
def schema_callback() -> List[FlowfileColumn]:
    """Calculate the output schema for a group by operation based on aggregation choices."""
    # Extract column info from settings (simplified representation)
    output_columns = [(c.old_name, c.new_name, c.output_type)
                      for c in group_by_settings.groupby_input.agg_cols]

    # Get input schema info from the upstream node ('depends_on')
    input_schema_dict: Dict[str, str] = {s.name: s.data_type for s in depends_on.schema}
    output_schema: List[FlowfileColumn] = []

    # Construct the output schema based on settings and input types
    for old_name, new_name, data_type in output_columns:
        # Infer data type if not explicitly set by the aggregation
        data_type = input_schema_dict[old_name] if data_type is None else data_type
        output_schema.append(FlowfileColumn.from_input(data_type=data_type, column_name=new_name))

    return output_schema

The directed acyclic graph (DAG): The foundation of workflows

As you add and connect nodes, Flowfile builds a Directed Acyclic Graph (DAG) where:

Nodes represent data operations (read file, filter, join, write to database, etc.).
Edges represent the flow of data between operations.

The DAG is managed by the EtlGraph class in the Core service, which orchestrates the entire workflow:

class EtlGraph:
    """
    Manages the ETL workflow as a DAG. Stores nodes, dependencies,
    and settings, and handles the execution order.
    """
    uuid: str
    _node_db: Dict[Union[str, int], NodeStep]  # Internal storage for all node steps
    _flow_starts: List[NodeStep]               # Nodes that initiate data flow (e.g., readers)
    _node_ids: List[Union[str, int]]           # Tracking node identifiers
    flow_settings: schemas.FlowSettings        # Global configuration for the flow

    def add_node_step(self, node_id: Union[int, str], function: Callable,
                    node_type: str, **kwargs) -> None:
        """Adds a new processing node (NodeStep) to the graph."""
        node_step = NodeStep(node_id=node_id, function=function, node_type=node_type, **kwargs)
        self._node_db[node_id] = node_step
        # Additional logic to manage dependencies and flow starts...

    def run_graph(self) -> RunInformation:
        """Executes the entire flow in the correct topological order."""
        execution_order = self.topological_sort() # Determine correct sequence
        run_info = RunInformation()
        for node in execution_order:
            # Execute node based on mode (Development/Performance)
            node_results = node.execute_node() # Simplified representation
            run_info.add_result(node.node_id, node_results)
        return run_info

    def topological_sort(self) -> List[NodeStep]:
        """Determines the correct order to execute nodes based on dependencies."""
        # Standard DAG topological sort algorithm...
        pass

Each NodeStep in the graph encapsulates information about its dependencies, transformation logic, and output schema. This structure allows Flowfile to determine execution order, track data lineage, optimize performance, and provide schema predictions throughout the pipeline.

Leveraging lazy evaluation with Polars

The real power of Flowfile comes from leveraging Polars' lazy evaluation. Instead of processing data immediately at each step, Polars builds an optimized execution plan. The actual computation is deferred until the result is explicitly requested (e.g., writing to a file or displaying data).

This provides several significant benefits:

Reduced Memory Usage: Data is loaded and processed only when necessary, often streaming through operations without loading entire datasets into memory at once.
Query Optimization: Polars analyzes the entire plan and can reorder, combine, or eliminate operations for maximum efficiency.
Parallel Execution: Polars automatically parallelizes operations across available CPU cores during execution.
Predicate Pushdown: Filters and selections are applied as early as possible in the plan, often directly at the data source level (like during file reading), minimizing the amount of data that needs to be processed downstream.

Consider reading, filtering, and writing data:

# Traditional eager approach (less efficient):
df = pl.read_csv("large_file.csv")  # Reads entire file into memory immediately
filtered_df = df.filter(pl.col("value") > 10) # Filters the in-memory dataframe
filtered_df.write_parquet("output.parquet") # Writes the filtered result

# Polars' lazy evaluation approach (efficient):
# 1. Build the plan (no data loaded yet)
lazy_plan = pl.scan_csv("large_file.csv") # Creates a plan to read the CSV
filtered_plan = lazy_plan.filter(pl.col("value") > 10) # Adds filtering to the plan

# 2. Execute the optimized plan (only happens here)
result_df = filtered_plan
filtering efficiently
result_df.sink_parquet("output.parquet") # Writes the final result

Flowfile uses this lazy approach extensively, especially in Performance Mode.

Execution modes: Development vs. Performance

Flowfile offers two execution modes tailored for different needs:

Feature	Development Mode	Performance Mode
Purpose	Interactive debugging, step inspection	Optimized execution for production/speed
Execution	Executes node-by-node	Builds full plan, executes minimally
Data Caching	Caches intermediate results per step	Minimal caching (only if specified/needed)
Preview Data	Available for all nodes	Only for final/cached nodes
Memory Usage	Potentially higher	Generally lower
Speed	Moderate	Faster for complex flows

Development Mode

In Development mode, each node's transformation is triggered sequentially. After a node executes (within the Worker service), its intermediate result is typically serialized using Apache Arrow IPC format and cached to disk. This allows you to inspect the data at each step in the Designer via small samples fetched from the cache.

Performance Mode

In Performance mode, Flowfile fully embraces Polars' lazy evaluation:

The Core service constructs the entire Polars execution plan based on the DAG.
This plan (LazyFrame) is passed to the Worker service.
The Worker only materializes (executes .collect() or .sink_*()) the plan when:
- An output node (like writing to a file) requires the final result.
- A node is explicitly configured to cache its results (node.cache_results).

This minimizes computation and memory usage by avoiding unnecessary intermediate materializations.

# Execution logic in Performance Mode (simplified)
def execute_performance_mode(self, node: NodeStep, is_output_node: bool):
    """Handles execution in performance mode, leveraging lazy evaluation."""
    if is_output_node or node.cache_results:
        # If result is needed (output or caching), trigger execution in Worker
        external_df_fetcher = ExternalDfFetcher(
            lf=node.get_resulting_data().data_frame, # Pass the LazyFrame plan
            file_ref=node.hash, # Unique reference for caching
            wait_on_completion=False # Usually async
        )
        # Worker executes .collect() or .sink_*() and caches if needed
        result = external_df_fetcher.get_result() # May return LazyFrame or trigger compute
        return result # Or potentially just confirmation if sinking
    else:
        # If not output/cached, just pass the LazyFrame plan along
        # No computation happens here for intermediate nodes
        return node.get_resulting_data().data_frame

Crucially, all actual data processing and materialization of Polars DataFrames/LazyFrames happens in the Worker service. This separation prevents large datasets from overwhelming the Core service, ensuring the UI remains responsive.

Efficient Inter-process Communication (IPC) Between Core and Worker

Since Core orchestrates and Worker computes, efficient communication is vital. Flowfile uses Apache Arrow IPC format for data exchange:

Worker Processing: When the Worker needs to materialize data (either intermediate cache in Dev mode or final results), it computes the Polars DataFrame.
Serialization & Caching: The resulting DataFrame is serialized into the efficient Arrow IPC binary format and saved to a temporary file on disk. This file acts as a cache, identified by a unique hash (file_ref).
Reference Passing: The Worker informs the Core that the result is ready at the specified file_ref.
Core Fetching: If the Core (or subsequently, another Worker task) needs this data, it uses the file_ref to access the cached Arrow file directly. This avoids sending large datasets over network sockets between processes.
UI Sampling: For UI previews, the Core requests a small sample (e.g., the first 100 rows) from the Worker. The Worker reads just the sample from the Arrow IPC file and sends only that lightweight data back to the Core, which forwards it to the Designer.

Here’s how the Core might offload computation to the Worker:

# Core side - Initiating remote execution in the Worker
def execute_remote(self, performance_mode: bool = False) -> None:
    """Offloads the execution of a node's LazyFrame to the Worker service."""
    # Create a fetcher instance to manage communication with the Worker
    external_df_fetcher = ExternalDfFetcher(
        lf=self.get_resulting_data().data_frame, # The Polars LazyFrame plan
        file_ref=self.hash,                      # Unique identifier for the result/cache
        wait_on_completion=False,                # Operate asynchronously
        # Pass other necessary context like flow_id, node_id, operation type...
    )

    # Store the fetcher to potentially retrieve results later
    self._fetch_cached_df = external_df_fetcher

    # Request the Worker to start processing (this returns quickly)
    # The actual computation happens asynchronously in the Worker
    external_df_fetcher.start_processing_in_worker() # Hypothetical method name

    # If we immediately need the result (e.g., for a subsequent synchronous step):
    # lf = external_df_fetcher.get_result() # This would block until Worker is done
    # self.results.resulting_data = FlowfileTable(lf)

    # For UI updates, request a sample separately
    self.store_example_data_generator(external_df_fetcher) # Fetches sample async

And how the Worker might manage the separate process execution:

# Worker side - Managing computation in a separate process
def start_process(
    polars_serializable_object: bytes, # Serialized LazyFrame plan
    task_id: str,
    file_ref: str, # Path for cached output (Arrow IPC file)
    # ... other args like operation type
) -> None:
    """Launches a separate OS process to handle the heavy computation."""
    # Use multiprocessing context for safety
    mp_context = multiprocessing.get_context('spawn') # or 'fork' depending on OS/needs

    # Shared memory/queue for progress tracking and results/errors
    progress = mp_context.Value('i', 0) # Shared integer for progress %
    error_message = mp_context.Array('c', 1024) # Shared buffer for error messages
    queue = mp_context.Queue(maxsize=1) # For potentially passing back results (or file ref)

    # Define the target function and arguments for the new process
    process = mp_context.Process(
        target=process_task, # The function that runs Polars .collect()/.sink()
        kwargs={
            'polars_serializable_object': polars_serializable_object,
            'progress': progress,
            'error_message': error_message,
            'queue': queue,
            'file_path': file_ref, # Where to save the Arrow IPC output
            # ... other necessary kwargs
        }
    )
    process.start() # Launch the independent process

    # Monitor the task (e.g., update status in a database, check progress)
    handle_task(task_id, process, progress, error_message, queue)

This architecture ensures:

Responsiveness: The Core service remains light and responsive to UI interactions.
Memory Isolation: Heavy memory usage is contained within transient Worker processes.
Efficiency: Arrow IPC provides fast serialization/deserialization and efficient disk caching.
Scalability: Allows handling datasets larger than the memory of any single service by processing chunks or relying on Polars' streaming capabilities where applicable.

Conclusion: Combining Visual Simplicity with Polars' Power

Flowfile tries to bridge the gap between the intuitive, visual workflow building familiar from tools like Alteryx or KNIME, and the raw performance and flexibility you get from modern data libraries like Polars.

By architecting the system around Polars' lazy evaluation, providing real-time schema feedback, and employing a robust multi-process model with efficient Arrow IPC communication, Flowfile delivers a user-friendly experience without sacrificing the speed needed for demanding data tasks. It's designed as a powerful tool for data analysts who prefer visual interfaces and data engineers who need performance and control.

Flowfile is intended as a performant, intuitive, and open-source solution for modern data transformation challenges.

If you're interested in trying Flowfile, exploring its capabilities, or contributing to its development, check out the Flowfile GitHub repository. Feedback and contributions are highly encouraged!

What are your favorite visual ETL tools, or how do you combine visual and code-based approaches in your data workflows? Share your thoughts in the comments below!