<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Edwardvaneechoud</title>
    <description>The latest articles on Forem by Edwardvaneechoud (@edwardvaneechoud).</description>
    <link>https://forem.com/edwardvaneechoud</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1565620%2F7ab3a32e-c02f-42ec-9cb1-3cf52c92432d.png</url>
      <title>Forem: Edwardvaneechoud</title>
      <link>https://forem.com/edwardvaneechoud</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/edwardvaneechoud"/>
    <language>en</language>
    <item>
      <title>Flowfile v0.8.0 — Your Flows Can Run Themselves Now</title>
      <dc:creator>Edwardvaneechoud</dc:creator>
      <pubDate>Thu, 26 Mar 2026 07:15:21 +0000</pubDate>
      <link>https://forem.com/edwardvaneechoud/flowfile-v080-your-flows-can-run-themselves-now-3im2</link>
      <guid>https://forem.com/edwardvaneechoud/flowfile-v080-your-flows-can-run-themselves-now-3im2</guid>
      <description>&lt;p&gt;&lt;em&gt;Full disclosure: I built Flowfile. This post is about a feature I just shipped and why I think it matters.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Up until this release, Flowfile was a tool you &lt;em&gt;designed&lt;/em&gt; pipelines in. You'd build a flow visually, maybe export the code, and then figure out how to actually run it on a schedule. That "figure out" part usually meant cron, Airflow, or a very optimistic shell script.&lt;/p&gt;

&lt;p&gt;v0.8.0 changes that. Flowfile now has a built-in scheduler. You can set an interval, trigger flows when data updates, or just hit "Run" from the catalog. No external orchestrator needed.&lt;/p&gt;

&lt;p&gt;I'm not going to pretend this competes with Airflow or Dagster for serious production workloads. It doesn't. But if you've ever had a lightweight data platform where you just needed a few flows to run on a timer without setting up a whole orchestration layer — this is that.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwtftip3w3sal022ot8hc.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwtftip3w3sal022ot8hc.png" alt="Schedules Overview showing scheduler status, summary cards, and schedule list with interval and table trigger types" width="800" height="298"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The Part I'm Most Excited About: Table Triggers
&lt;/h2&gt;

&lt;p&gt;Interval scheduling is useful but not interesting. "Run every 30 minutes" is solved. What I actually want to talk about is table triggers.&lt;/p&gt;

&lt;p&gt;The idea is simple: instead of running a flow on a timer, you run it when the data it depends on changes. A Catalog Writer node overwrites a table, and any flow watching that table starts automatically.&lt;/p&gt;

&lt;p&gt;In practice, this means you can chain flows together through data. Flow A produces a cleaned customer table. Flow B, which reads that table, runs as soon as Flow A finishes writing it. No coordination logic, no polling intervals to tune, no "let's just run everything at 3am and hope the order works out."&lt;/p&gt;

&lt;h3&gt;
  
  
  Three Types, and Why
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Table trigger&lt;/strong&gt; — a flow runs when a single table is refreshed. The obvious case. Your flow reads &lt;code&gt;orders_raw&lt;/code&gt;, you set a trigger on &lt;code&gt;orders_raw&lt;/code&gt;, and the flow runs every time that table gets new data.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Table set trigger&lt;/strong&gt; — a flow runs when &lt;em&gt;all&lt;/em&gt; tables in a set have been refreshed. This one would have saved me real time on past projects. Picture this: you have a reporting flow that joins three source tables — sales, inventory, and returns. You never want the report to run until all three are fresh. Without table set triggers, you solve this with retry logic, completion flags, or by scheduling everything sequentially and adding generous wait times. With table set triggers, you just list the three tables and the flow runs once all three have been updated. Declarative, no coordination code.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Unrestricted table trigger&lt;/strong&gt; — a flow can also trigger on a table it doesn't read. I debated whether to restrict triggers to only tables the flow actually depends on. Seemed cleaner. But I kept coming back to the same thought: I can't predict every use case people will have. Maybe you want a notification flow to fire whenever a table updates. Maybe you have a cleanup job that should run after any write to a particular namespace. I don't know. So I left it open.&lt;/p&gt;

&lt;h2&gt;
  
  
  How the Scheduler Works
&lt;/h2&gt;

&lt;p&gt;Three modes, depending on how you run Flowfile:&lt;/p&gt;

&lt;p&gt;If you're on the desktop app or &lt;code&gt;pip install&lt;/code&gt;, the scheduler runs inside the Flowfile process. Start and stop it from the Schedules tab. Simple.&lt;/p&gt;

&lt;p&gt;If you want it running independently:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;flowfile run flowfile_scheduler
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That gives you a standalone background service that survives UI restarts. Only one scheduler instance runs at a time — there's an advisory database lock with a heartbeat. If a scheduler dies, another can take over after 90 seconds.&lt;/p&gt;

&lt;p&gt;For Docker, add one environment variable:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;environment&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;FLOWFILE_SCHEDULER_ENABLED=true&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;One thing I got right early: each flow can only have one active run at a time. If a flow is already running, new triggers are skipped. This prevents the cascade of overlapping runs that makes scheduling systems miserable to debug.&lt;/p&gt;

&lt;h2&gt;
  
  
  Running Flows Without a Schedule
&lt;/h2&gt;

&lt;p&gt;Not everything needs to be automated. Sometimes you just want to run a registered flow right now and see what happens.&lt;/p&gt;

&lt;p&gt;v0.8.0 adds a &lt;strong&gt;Run Flow&lt;/strong&gt; button directly in the catalog's flow detail panel. It spawns a subprocess, tracks the run in your history, and writes logs to &lt;code&gt;~/.flowfile/logs/&lt;/code&gt;. You can cancel a running flow at any time — it sends SIGTERM to the process.&lt;/p&gt;

&lt;p&gt;When flows are running, an active runs banner appears at the top of the catalog. You can see what's running, when it started, and cancel anything that's taking too long. &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fww5s74axdy7dtqrfmw2x.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fww5s74axdy7dtqrfmw2x.png" alt="Run History showing scheduled, manual, and designer runs with status, duration, and node completion" width="800" height="600"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Being Honest About Scope
&lt;/h2&gt;

&lt;p&gt;This scheduler is not Airflow. It doesn't have retries with exponential backoff. It doesn't have DAG-level dependency management across dozens of flows. It doesn't have a distributed executor.&lt;/p&gt;

&lt;p&gt;What it does have: zero additional infrastructure. If you can &lt;code&gt;pip install flowfile&lt;/code&gt;, you have a scheduler. If you can run Docker Compose, you have a scheduler. For a team that needs three flows running on an interval and two more triggered by data updates, that's enough. For a lightweight data platform that's still figuring out what it needs, this could be the first proof of concept before anyone decides whether Airflow is worth the setup.&lt;/p&gt;

&lt;p&gt;And if you outgrow it — that's fine. Flowfile still generates standalone Python code. The flows don't lock you in, and neither does the scheduler.&lt;/p&gt;

&lt;p&gt;Where this is heading: table triggers are a step, not the destination. The real goal is that you shouldn't be thinking about schedules at all. You should define how fresh your data needs to be, and the system figures out the rest. The catalog already tracks what each flow reads and writes, when tables were last updated, and how flows relate to each other. The pieces are there. Schedules are the scaffolding — data freshness is the building.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try It
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;flowfile
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;&lt;a href="https://github.com/Edwardvaneechoud/Flowfile" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/Edwardvaneechoud/Flowfile/releases" rel="noopener noreferrer"&gt;Release Notes&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://edwardvaneechoud.github.io/Flowfile/users/visual-editor/schedules.html" rel="noopener noreferrer"&gt;Scheduling Docs&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you have questions or feedback, I'd genuinely love to hear it. Especially if you have a use case for table triggers I haven't thought of — that's exactly why I left them unrestricted.&lt;/p&gt;

</description>
      <category>automation</category>
      <category>dataengineering</category>
      <category>showdev</category>
      <category>tooling</category>
    </item>
    <item>
      <title>Build Your Own DataFrame: a course based on an engine I probably shouldn't have written</title>
      <dc:creator>Edwardvaneechoud</dc:creator>
      <pubDate>Fri, 20 Mar 2026 17:47:30 +0000</pubDate>
      <link>https://forem.com/edwardvaneechoud/build-your-own-dataframe-a-course-based-on-an-engine-i-probably-shouldnt-have-written-2apd</link>
      <guid>https://forem.com/edwardvaneechoud/build-your-own-dataframe-a-course-based-on-an-engine-i-probably-shouldnt-have-written-2apd</guid>
      <description>&lt;p&gt;A few years ago, I needed a data processing engine for a visual ETL tool I was building — &lt;a href="https://github.com/Edwardvaneechoud/Flowfile" rel="noopener noreferrer"&gt;Flowfile&lt;/a&gt; — and against all sane practices, I just started writing one. Pure Python. &lt;code&gt;itertools.groupby&lt;/code&gt; for aggregation. &lt;code&gt;operator.itemgetter&lt;/code&gt; for column access, own type inference, manual memory optimization, custom everything.&lt;/p&gt;

&lt;p&gt;It handled joins, pivots, groupby, explode, filters — a working engine built entirely on the standard library. No numpy, no C extensions, no dependencies at all.&lt;/p&gt;

&lt;p&gt;Was this a good idea? Probably not the most efficient path. But it taught me something I couldn't have learned any other way: I understood exactly what a dataframe library does, because I'd built every piece of one myself.&lt;/p&gt;

&lt;p&gt;When I eventually migrated Flowfile's engine to Polars, the pure Python engine went into a drawer. That migration was driven by something I realized about focus: you can't do everything. Building a custom dataframe engine was a great way to learn, but it was a terrible way to ship a product. Flowfile needed things that actually mattered — the visual editor, the code generation, the user experience — not maintaining a homegrown query engine. Ironically, I'm now doing more for Flowfile than ever again. But I believe/hope (lol) they're the important things.&lt;/p&gt;

&lt;p&gt;Sometimes it's fun to look back at your old code — especially the code from before AI — and see how you were solving problems back then. The old engine was all me: no Copilot, no autocomplete suggestions, just reading docs and figuring it out. I kept thinking it would be fun to turn it into something people could learn from. So I cleaned it up, published it as &lt;a href="https://github.com/Edwardvaneechoud/pyfloe" rel="noopener noreferrer"&gt;pyfloe&lt;/a&gt;, and wrote a course around it — one where the why always comes before the how.&lt;/p&gt;

&lt;h2&gt;
  
  
  The problem with "how libraries work" content
&lt;/h2&gt;

&lt;p&gt;Most educational content about library internals falls into two buckets.&lt;/p&gt;

&lt;p&gt;The first is the conceptual walkthrough of production internals. "Let's look at how Polars implements joins!" And then you're reading about Rust iterators, SIMD intrinsics, and memory layouts — real concepts, but at a level of abstraction where it's hard to connect the ideas back to something you could build or modify yourself. You come away knowing the pieces exist without quite seeing how they fit together.&lt;/p&gt;

&lt;p&gt;The second is the toy example. Build a 50-line DataFrame class with &lt;code&gt;__init__&lt;/code&gt;, &lt;code&gt;filter&lt;/code&gt;, and &lt;code&gt;select&lt;/code&gt;. Wave your hands. "Real libraries do something like this, but more complicated." Clean mental model, almost no connection to how things actually work.&lt;/p&gt;

&lt;p&gt;Neither approach works for me. I've always learned the most from implementing things that needed to function — not reading about them, not building throwaway demos, but building something real enough that the architectural decisions actually matter.&lt;/p&gt;

&lt;h2&gt;
  
  
  Two trees, one engine
&lt;/h2&gt;

&lt;p&gt;That's what pyfloe is. A lazy dataframe engine that implements simplified versions of the patterns behind real query engines — expression ASTs, the volcano execution model, hash joins, a rule-based optimizer — in about six files of readable Python.&lt;/p&gt;

&lt;p&gt;Here's what using it looks like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;pyfloe&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;pf&lt;/span&gt;

&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;pf&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read_csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;orders.csv&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;filter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pf&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;amount&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;with_column&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rank&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;pf&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;row_number&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;over&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;partition_by&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;region&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;order_by&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;amount&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;select&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;order_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;region&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;amount&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rank&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sort&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;region&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rank&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;collect&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you know Polars, this looks familiar. That's intentional.&lt;/p&gt;

&lt;p&gt;The core insight — the thing that makes everything in pyfloe (and Polars, and Spark) click — is that the engine is built around two tree structures.&lt;/p&gt;

&lt;p&gt;The &lt;strong&gt;plan tree&lt;/strong&gt; describes how data flows. Each node is an operation: scan a file, filter rows, join two tables, project columns. When you call &lt;code&gt;.collect()&lt;/code&gt;, the engine walks this tree from the root and pulls data upward.&lt;/p&gt;

&lt;p&gt;The &lt;strong&gt;expression tree&lt;/strong&gt; describes what to compute on each row. When you write &lt;code&gt;pf.col("amount") &amp;gt; 100&lt;/code&gt;, Python's &lt;code&gt;__gt__&lt;/code&gt; method doesn't compare anything — it builds a &lt;code&gt;BinaryExpr&lt;/code&gt; node with a &lt;code&gt;Col("amount")&lt;/code&gt; on the left and a &lt;code&gt;Lit(100)&lt;/code&gt; on the right.&lt;/p&gt;

&lt;p&gt;Here's the trick. You override &lt;code&gt;__gt__&lt;/code&gt; so it returns a new object instead of a boolean:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;Expr&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__gt__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;other&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;BinaryExpr&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;_ensure_expr&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;other&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;operator&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;gt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;&amp;gt;&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now &lt;code&gt;col("amount") &amp;gt; 100&lt;/code&gt; doesn't evaluate — it builds a tree. And because that tree is data, you can walk it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;expr&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;price&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;quantity&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;1000&lt;/span&gt;
&lt;span class="n"&gt;expr&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;required_columns&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="c1"&gt;# → {"price", "quantity"}
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Two columns out of fifty. The optimizer now knows it can tell the scan node to skip the other forty-eight.&lt;/p&gt;

&lt;p&gt;Expressions live inside plan nodes. A &lt;code&gt;FilterNode&lt;/code&gt; doesn't hold a lambda — it holds an expression tree. And because that expression is inspectable, the optimizer can figure out which columns it needs and rewrite the plan to eliminate unnecessary work.&lt;/p&gt;

&lt;p&gt;This is the same general approach Polars uses — though Polars goes much further with cost-based optimization, parallelism, and columnar memory. The difference is that in pyfloe, you can open &lt;code&gt;plan.py&lt;/code&gt; and read the whole thing in twenty minutes.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the course covers
&lt;/h2&gt;

&lt;p&gt;I wrote a free course around pyfloe: &lt;a href="https://edwardvaneechoud.github.io/pyfloe-tutorial/introduction/" rel="noopener noreferrer"&gt;Build Your Own DataFrame&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The whole thing runs on Pyodide — Python in the browser via WebAssembly. Every module has interactive code blocks where you write actual Python, run it, and see the results. When "trying it" is just a click, more people actually try it.&lt;/p&gt;

&lt;p&gt;Five modules, each building on the last. You start with eager vs. lazy execution and the volcano model. Then you hijack Python's dunder methods so that &lt;code&gt;col("x") + col("y")&lt;/code&gt; builds an inspectable tree instead of adding anything. Then you plug expressions into plan nodes, implement hash joins and aggregation, and end with the optimizer — filter pushdown and column pruning, where the plan tree and expression trees finally converge.&lt;/p&gt;

&lt;p&gt;You build simplified versions of each layer, then compare to the real pyfloe source.&lt;/p&gt;

&lt;h2&gt;
  
  
  Honest boundaries
&lt;/h2&gt;

&lt;p&gt;If it needs saying: pyfloe is not competing with Polars on performance. It can't — Python is orders of magnitude slower for this kind of work. Polars is one of those libraries that changed the way people interact with data in Python, and it's on the forefront of how a dataframe API should work. pyfloe implements the same structures in pure Python — so you can read them, break them, and understand them.&lt;/p&gt;

&lt;p&gt;Where it makes sense, we simplify. The optimizer is rule-based, not cost-based. Real engines like Polars estimate row counts and pick join strategies dynamically. pyfloe applies fixed rules — push filters down, prune unused columns — which is enough to show you &lt;em&gt;why&lt;/em&gt; optimizers exist and how they rewrite plan trees.&lt;/p&gt;

&lt;p&gt;The course also doesn't try to cover everything. Streaming I/O is in a bonus section, not the main path. Five modules, focused on the core architecture.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why this matters to me
&lt;/h2&gt;

&lt;p&gt;I like understanding how things work. That's why I built the engine from scratch in the first place, and it's why I turned it into a course instead of just deleting it. Flowfile ended up being the tool I actually use — a visual ETL platform where you can build pipelines, inspect data at every step, and export clean Polars code. pyfloe is the original engine that made Flowfile possible, stripped down to where you can see every moving part.&lt;/p&gt;

&lt;p&gt;If you're the kind of person who wants to know what a dataframe library actually does when you call &lt;code&gt;.filter()&lt;/code&gt; or &lt;code&gt;.group_by()&lt;/code&gt;, the course is free and the code is all Python you can read. That's the whole pitch.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Course:&lt;/strong&gt; &lt;a href="https://edwardvaneechoud.github.io/pyfloe-tutorial/introduction/" rel="noopener noreferrer"&gt;Build Your Own DataFrame&lt;/a&gt;&lt;br&gt;
&lt;strong&gt;Library:&lt;/strong&gt; &lt;a href="https://github.com/Edwardvaneechoud/pyfloe" rel="noopener noreferrer"&gt;pyfloe on GitHub&lt;/a&gt;&lt;br&gt;
&lt;strong&gt;The tool that started it:&lt;/strong&gt; &lt;a href="https://github.com/Edwardvaneechoud/Flowfile" rel="noopener noreferrer"&gt;Flowfile on GitHub&lt;/a&gt;&lt;/p&gt;

</description>
      <category>python</category>
      <category>learning</category>
      <category>dataframe</category>
    </item>
    <item>
      <title>Stop Drawing ETL Diagrams — Your Python Code Visualizes Itself</title>
      <dc:creator>Edwardvaneechoud</dc:creator>
      <pubDate>Thu, 29 May 2025 16:26:48 +0000</pubDate>
      <link>https://forem.com/edwardvaneechoud/stop-drawing-etl-diagrams-your-python-code-visualizes-itself-16gn</link>
      <guid>https://forem.com/edwardvaneechoud/stop-drawing-etl-diagrams-your-python-code-visualizes-itself-16gn</guid>
      <description>&lt;p&gt;Ever wished you could write Python code and get the clarity of a visual data flow? That's exactly what Flowfile offers with FlowFrame — a Polars-like API that silently builds a visual ETL graph as you code.&lt;/p&gt;

&lt;h2&gt;
  
  
  The problem we're solving
&lt;/h2&gt;

&lt;p&gt;As data engineers and scientists, we often find ourselves in one of two camps:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Writing complex Python code that's powerful but hard to explain&lt;/li&gt;
&lt;li&gt;Drawing diagrams that are clear but disconnected from actual implementation&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;What if your code could be both powerful AND self-documenting?&lt;/p&gt;

&lt;h2&gt;
  
  
  Enter FlowFrame: code that visualizes itself
&lt;/h2&gt;

&lt;p&gt;FlowFrame, part of the open-source &lt;a href="https://github.com/Edwardvaneechoud/Flowfile" rel="noopener noreferrer"&gt;Flowfile project&lt;/a&gt;, bridges this gap by combining the precision of Python coding with the clarity of visual ETL pipelines.&lt;/p&gt;

&lt;h3&gt;
  
  
  Key benefits:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Write familiar, Polars-style Python code&lt;/li&gt;
&lt;li&gt;Automatically generate visual pipelines you can view, edit, and share&lt;/li&gt;
&lt;li&gt;Seamlessly switch between code and visual interfaces&lt;/li&gt;
&lt;li&gt;Full LazyFrame API support (new in v0.3.3!)&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Getting started
&lt;/h2&gt;

&lt;p&gt;Install Flowfile with a single command:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;flowfile
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  See it in action
&lt;/h2&gt;

&lt;p&gt;Let's explore with a practical example using a sales dataset:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;flowfile&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;ff&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;flowfile&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;open_graph_in_editor&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;when&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;lit&lt;/span&gt;

&lt;span class="c1"&gt;# Read the sales data
&lt;/span&gt;&lt;span class="n"&gt;sales&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ff&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read_csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;files/Superstore Sales Dataset.csv&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;separator&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;,&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Helper function to clean column names
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;correct_column_names&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;ff&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;FlowFrame&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;ff&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;FlowFrame&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;columns&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;columns&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;select&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;ff&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;alias&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt; &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;_&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;columns&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
                     &lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Column names to lowercase&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;sales_clean&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;correct_column_names&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sales&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Advanced transformations with the FlowFrame API
&lt;/span&gt;&lt;span class="n"&gt;transformed&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;sales_clean&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;with_columns&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;
        &lt;span class="c1"&gt;# Calculate shipping time in days
&lt;/span&gt;        &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ship_date&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;order_date&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)).&lt;/span&gt;&lt;span class="n"&gt;dt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;total_days&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;alias&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;shipping_days&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="c1"&gt;# Extract year from order date for trend analysis
&lt;/span&gt;        &lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;order_date&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;dt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;year&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;alias&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;order_year&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="c1"&gt;# Create sales tiers based on order value
&lt;/span&gt;        &lt;span class="nf"&gt;when&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sales&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;then&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;lit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Small&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;when&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sales&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;then&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;lit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Medium&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;when&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sales&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;500&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;then&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;lit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Large&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;otherwise&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;lit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Enterprise&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;alias&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;order_tier&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="c1"&gt;# Clean up state codes (handle nulls)
&lt;/span&gt;        &lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;state&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;fill_null&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Unknown&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;alias&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;state_clean&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Create order characteristics&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;filter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;shipping_days&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Remove invalid shipping times&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;group_by&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;category&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;segment&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;order_tier&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;agg&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;
        &lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sales&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;alias&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;total_sales&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sales&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;mean&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;alias&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;avg_order_value&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;shipping_days&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;mean&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;alias&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;avg_shipping_days&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;order_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;n_unique&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;alias&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;unique_orders&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;customer_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;n_unique&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;alias&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;unique_customers&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;])&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;with_columns&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;
        &lt;span class="c1"&gt;# Calculate customer concentration ratio
&lt;/span&gt;        &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;unique_customers&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;unique_orders&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;round&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;alias&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;customer_concentration_pct&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Create nice readable percentages&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;filter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;total_sales&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Filter on significant segments&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sort&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;total_sales&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;avg_order_value&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;descending&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# The magic happens here — visualize your pipeline!
&lt;/span&gt;&lt;span class="nf"&gt;open_graph_in_editor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;transformed&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;flow_graph&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This launches the Flowfile Designer, showing your entire pipeline visually. Each node represents an operation, allowing you to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Visualize data flow&lt;/li&gt;
&lt;li&gt;Debug by inspecting results at each step&lt;/li&gt;
&lt;li&gt;See data previews&lt;/li&gt;
&lt;li&gt;Make visual edits that sync back to code&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpptf8bwd08wl820nkeap.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpptf8bwd08wl820nkeap.png" alt="Flowfile Designer showing a visual ETL pipeline with connected nodes representing data transformations, filters, and aggregations from the sales analysis example" width="800" height="471"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  What's new in v0.3.3: full Polars LazyFrame support
&lt;/h2&gt;

&lt;p&gt;The latest release brings massive improvements to the FlowFrame API:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Type safety
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Now with full type hints and autocompletion!
&lt;/span&gt;&lt;span class="n"&gt;transformed&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;select&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;customer_concentration_pct&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;cum_sum&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;  &lt;span class="c1"&gt;# IDE knows all available methods
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  2. Dynamic expression methods
&lt;/h3&gt;

&lt;p&gt;All Polars expression methods are now available:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# List operations
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;with_columns&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tags&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;alias&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tag_count&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

&lt;span class="c1"&gt;# String operations  
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;filter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;email&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;contains&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;@company.com&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

&lt;span class="c1"&gt;# Datetime operations
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;with_columns&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;timestamp&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;dt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;hour&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;alias&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;hour&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

&lt;span class="c1"&gt;# Statistical operations
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;select&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;value&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;rolling_mean&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;7&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;alias&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;weekly_avg&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  3. User-defined functions support
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Custom functions are now tracked in the graph!
&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;functions&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;custom_transform&lt;/span&gt; 

&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;with_columns&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;value&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;map_batches&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;custom_transform&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;alias&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;transformed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhomqggur4k0drj5p21qz.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhomqggur4k0drj5p21qz.png" alt="Flowfile Designer interface displaying a custom function node integrated within the visual pipeline, demonstrating how user-defined functions appear as trackable components in the flow graph" width="800" height="385"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Real-world use cases
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Explaining complex transformations
&lt;/h3&gt;

&lt;p&gt;Ever tried explaining a complex data pipeline to non-technical stakeholders? With FlowFrame, build in Python, then share a visual flow everyone understands.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Debugging made visual
&lt;/h3&gt;

&lt;p&gt;When something doesn't look right, seeing the transformation flow and inspecting data at each step makes troubleshooting much faster than combing through lines of code.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Team collaboration
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Data scientists can start in code
&lt;/span&gt;&lt;span class="n"&gt;pipeline&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;create_complex_pipeline&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;pipeline&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;save_graph&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;quarterly_analysis.flowfile&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Analysts can continue visually in Flowfile Designer
# No code required!
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  What's next?
&lt;/h2&gt;

&lt;p&gt;Flowfile is constantly evolving:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Tighter code-visual mapping&lt;/strong&gt;: Every transformation accurately reflected in both directions&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Visual-to-code generation&lt;/strong&gt;: Generate clean Python code from visual designs&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Domain-specific workflows&lt;/strong&gt;: Specialized support for ML pipelines, time series analysis, and more&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cloud integrations&lt;/strong&gt;: Direct connections to cloud data warehouses&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Try it yourself
&lt;/h2&gt;

&lt;p&gt;FlowFrame bridges the gap between code and visual ETL tools, offering a powerful way to build data pipelines that are both efficient and understandable.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Your turn! Install and try it:
&lt;/span&gt;&lt;span class="n"&gt;pip&lt;/span&gt; &lt;span class="n"&gt;install&lt;/span&gt; &lt;span class="n"&gt;flowfile&lt;/span&gt;

&lt;span class="c1"&gt;# Then run the example above or try your own data
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The project is fully open-source on GitHub. We'd love your feedback, contributions, or ideas for new features!&lt;/p&gt;

</description>
      <category>python</category>
      <category>dataengineering</category>
      <category>opensource</category>
      <category>programming</category>
    </item>
    <item>
      <title>Streaming data with FastAPI &amp; Vue made easy</title>
      <dc:creator>Edwardvaneechoud</dc:creator>
      <pubDate>Fri, 25 Apr 2025 17:36:51 +0000</pubDate>
      <link>https://forem.com/edwardvaneechoud/streaming-data-with-fastapi-vue-made-easy-2926</link>
      <guid>https://forem.com/edwardvaneechoud/streaming-data-with-fastapi-vue-made-easy-2926</guid>
      <description>&lt;h2&gt;
  
  
  Streaming data with FastAPI &amp;amp; Vue made easy
&lt;/h2&gt;

&lt;p&gt;Want effortless live updates — stock prices, notifications, logs — in your web app without complex setup? Server-Sent Events (SSE) simplify this using standard HTTP. They provide an efficient server-to-client push mechanism, hitting a sweet spot between clunky polling and potentially overkill WebSockets for one-way data. This guide demonstrates how to implement SSE with FastAPI and Vue.js for responsive results.&lt;/p&gt;

&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;p&gt;Server-Sent Events (SSE) provide a simple way to stream real-time data from your server to web clients using standard HTTP. This guide demonstrates a minimal implementation with FastAPI and Vue.js, showing system RAM usage monitored in real-time. Get the code and try it yourself:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/Edwardvaneechoud/stream-logs-demo
git checkout feature/simple_example
&lt;span class="nb"&gt;cd &lt;/span&gt;stream-logs-demo
pip &lt;span class="nb"&gt;install &lt;/span&gt;fastapi uvicorn psutil
python stream_logs.py
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then open your browser to &lt;a href="http://localhost:8000" rel="noopener noreferrer"&gt;http://localhost:8000&lt;/a&gt; and click "Start Monitoring RAM" to see it in action. And if you want all the details… read on!&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvtcz3lxxxkoq5k255q4x.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvtcz3lxxxkoq5k255q4x.png" alt="RAM monitor demo" width="800" height="624"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  What are Server-Sent Events (SSE)?
&lt;/h2&gt;

&lt;p&gt;Server-Sent Events (SSE) is a standard web technology designed for this. It's a simple and efficient way to stream data over HTTP. SSE works by keeping a single HTTP connection open between the client and server after the initial request. This open connection allows the server to automatically push updates to the client whenever new information is available.&lt;/p&gt;

&lt;p&gt;Compared to alternatives:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Polling&lt;/strong&gt;: With polling, the client repeatedly asks the server for updates at regular intervals. This can lead to unnecessary network traffic and delayed data. SSE is more efficient because the server pushes updates only when there's new data — no need for the client to keep checking.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;WebSockets&lt;/strong&gt;: WebSockets are built for full-duplex communication — both the client and server can send messages independently at any time. This makes them ideal for real-time, interactive applications like live chat, multiplayer games, or collaborative tools where two-way communication is essential. But if your use case only requires the server to push updates — for example, live logs, real-time monitoring, or status feeds — WebSockets can be unnecessarily complex. That's where SSE shines.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Because SSE runs over regular HTTP, there's no need to introduce a new protocol — it just works like any other HTTP request. That means you can plug it directly into most existing setups with zero friction. If your app already talks over HTTP, you're ready to stream.&lt;/p&gt;

&lt;h2&gt;
  
  
  Implementing SSE in FastAPI
&lt;/h2&gt;

&lt;p&gt;In FastAPI, the implementation relies on &lt;code&gt;StreamingResponse&lt;/code&gt; combined with an asynchronous generator. Here's the essential core code from &lt;code&gt;stream_logs.py&lt;/code&gt; that sets up the SSE endpoint (note: the full file also includes app setup, static file serving, etc.):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# From stream_logs.py
&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;psutil&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;typing&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;AsyncGenerator&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;fastapi&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;FastAPI&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;fastapi.responses&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;StreamingResponse&lt;/span&gt;

&lt;span class="c1"&gt;# Assume 'app = FastAPI()' is defined elsewhere
&lt;/span&gt;
&lt;span class="c1"&gt;# 1. SSE Message Formatter
&lt;/span&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;format_sse_message&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="c1"&gt;# Ensures data adheres to the 'data: ...\n\n' SSE format
&lt;/span&gt;    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;data: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

&lt;span class="c1"&gt;# 2. Asynchronous Data Generator
&lt;/span&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;stream_logs&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;AsyncGenerator&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="c1"&gt;# Initial log entry
&lt;/span&gt;    &lt;span class="k"&gt;yield&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;format_sse_message&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Starting system RAM monitoring...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Send 20 RAM usage log entries
&lt;/span&gt;    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;21&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="c1"&gt;# Get current timestamp
&lt;/span&gt;        &lt;span class="n"&gt;timestamp&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;strftime&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;%Y-%m-%d %H:%M:%S&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;localtime&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;

        &lt;span class="c1"&gt;# Get RAM usage
&lt;/span&gt;        &lt;span class="n"&gt;ram&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;psutil&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;virtual_memory&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="n"&gt;ram_percent&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ram&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;percent&lt;/span&gt;
        &lt;span class="n"&gt;ram_total&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ram&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;total&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1024&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;1024&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;1024&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# Convert to GB
&lt;/span&gt;        &lt;span class="n"&gt;ram_used&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ram_total&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;ram_percent&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;

        &lt;span class="c1"&gt;# Format log entry
&lt;/span&gt;        &lt;span class="n"&gt;log_entry&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;timestamp&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; - RAM usage: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;ram_percent&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;% (&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;ram_used&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;GB / &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;ram_total&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;GB)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="k"&gt;yield&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;format_sse_message&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;log_entry&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sleep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# One second delay between logs
&lt;/span&gt;
    &lt;span class="c1"&gt;# Final log entry
&lt;/span&gt;    &lt;span class="k"&gt;yield&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;format_sse_message&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;RAM monitoring completed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# 3. FastAPI Endpoint using StreamingResponse
&lt;/span&gt;&lt;span class="nd"&gt;@app.get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/api/stream&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;stream_logs_endpoint&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="c1"&gt;# Return the StreamingResponse, feeding it the generator
&lt;/span&gt;    &lt;span class="c1"&gt;# and setting the essential media_type for SSE
&lt;/span&gt;    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;StreamingResponse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="nf"&gt;stream_logs&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="c1"&gt;# The generator provides the data stream
&lt;/span&gt;        &lt;span class="n"&gt;media_type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text/event-stream&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="c1"&gt;# Identifies the stream as SSE
&lt;/span&gt;    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Explanation:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;&lt;code&gt;format_sse_message&lt;/code&gt;&lt;/strong&gt;: This utility function ensures each piece of data sent conforms to the SSE specification. The &lt;code&gt;data:&lt;/code&gt; prefix followed by the message payload and two newline characters (&lt;code&gt;\n\n&lt;/code&gt;) is the standard way to delimit messages in an event stream. Using &lt;code&gt;json.dumps&lt;/code&gt; ensures the data is reliably encoded as a JSON string.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;&lt;code&gt;stream_logs&lt;/code&gt;&lt;/strong&gt;: This async function is a generator because it uses the &lt;code&gt;yield&lt;/code&gt; keyword. Instead of running to completion and returning a single result, it can pause its execution state with &lt;code&gt;yield&lt;/code&gt;, sending back the specified value (here, the formatted message). When iterated over (by &lt;code&gt;StreamingResponse&lt;/code&gt;), its execution resumes from where it left off until the next yield or the function completes. The &lt;code&gt;await asyncio.sleep(1)&lt;/code&gt; simulates an asynchronous operation, like waiting for new data to become available.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;&lt;code&gt;@app.get("/api/stream")&lt;/code&gt;&lt;/strong&gt;: This decorator defines the HTTP GET endpoint that clients will connect to initiate the SSE stream.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;&lt;code&gt;StreamingResponse(...)&lt;/code&gt;&lt;/strong&gt;: This FastAPI class is key for sending data incrementally. It's initialized with the result of calling the generator (&lt;code&gt;stream_logs()&lt;/code&gt;), which gives it an iterator object. &lt;code&gt;StreamingResponse&lt;/code&gt; pulls values from this iterator (triggering the generator's execution up to the next yield) and sends each value immediately over the network connection.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;&lt;code&gt;media_type="text/event-stream"&lt;/code&gt;&lt;/strong&gt;: Setting this specific IANA media type in the &lt;code&gt;StreamingResponse&lt;/code&gt; is crucial. It sends the correct Content-Type header, signaling to the browser that this response is a Server-Sent Event stream, enabling the browser to use its built-in EventSource API to listen for and process the incoming messages.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Okay, now that we have the backend sending the stream, let's look at how the frontend receives and displays these updates using Vue.js and the browser's built-in EventSource API.&lt;/p&gt;

&lt;h2&gt;
  
  
  Frontend: Receiving Events with EventSource
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Frontend Setup Note&lt;/strong&gt;: For this demonstration, the frontend is kept intentionally simple. We use a single &lt;code&gt;index.html&lt;/code&gt; file which includes Vue.js directly from a CDN. Basic styling is provided in &lt;code&gt;style.css&lt;/code&gt;. The core logic for connecting to the SSE stream and updating the display resides in &lt;code&gt;app.js&lt;/code&gt;. You can explore the full frontend code in the repository.&lt;/p&gt;

&lt;p&gt;On the client side, the browser's built-in EventSource API is used to connect to the SSE stream and process incoming messages. While the full &lt;code&gt;static/app.js&lt;/code&gt; example uses Vue.js for display, let's isolate the essential EventSource interactions:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Set the reactive property
&lt;/h3&gt;

&lt;p&gt;Before we initiate any streaming, we define our reactive state and prepare variables in the &lt;code&gt;setup()&lt;/code&gt; function. This is where Vue's Composition API comes into play:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// From static/app.js&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;logLines&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;ref&lt;/span&gt;&lt;span class="p"&gt;([]);&lt;/span&gt;        &lt;span class="c1"&gt;// Reactive array bound to the template&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;isStreaming&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;ref&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;  &lt;span class="c1"&gt;// Tracks whether we're currently streaming&lt;/span&gt;
&lt;span class="kd"&gt;let&lt;/span&gt; &lt;span class="nx"&gt;eventSource&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;          &lt;span class="c1"&gt;// Will hold the SSE connection&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Explanation&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;logLines&lt;/code&gt; is a reactive array. Each time we push new data into it, Vue will automatically update the DOM—no manual re-rendering needed.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;isStreaming&lt;/code&gt; tracks whether a stream is currently active. It's used to toggle UI state (like disabling the button while data is being streamed).&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;eventSource&lt;/code&gt; is declared here so it can be shared across functions (&lt;code&gt;startStreaming&lt;/code&gt;, &lt;code&gt;closeEventSource&lt;/code&gt;). It's initially null and gets initialized when we start the stream.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  2. Define the start streaming Logic
&lt;/h3&gt;

&lt;p&gt;This function runs when the user clicks the "Start Monitoring RAM" button. It clears existing logs, updates state, and opens the SSE connection.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;startStreaming&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;logLines&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;value&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[];&lt;/span&gt;           &lt;span class="c1"&gt;// Clear previous logs&lt;/span&gt;
  &lt;span class="nx"&gt;isStreaming&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;value&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;      &lt;span class="c1"&gt;// Update UI state&lt;/span&gt;
  &lt;span class="c1"&gt;// Create the SSE connection&lt;/span&gt;
  &lt;span class="nx"&gt;eventSource&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;EventSource&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;/api/stream&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="c1"&gt;// Set up message and error handlers (see next section)&lt;/span&gt;
&lt;span class="p"&gt;};&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Explanation&lt;/strong&gt;: This function sets up the streaming session. The connection isn't live yet, but we're establishing it and preparing to listen for events.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Handle Incoming Events (onmessage)
&lt;/h3&gt;

&lt;p&gt;The logic for handling messages from the server is defined inside the &lt;code&gt;startStreaming&lt;/code&gt; function, immediately after the SSE connection is created. Here's what that looks like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Still inside startStreaming()&lt;/span&gt;

&lt;span class="nx"&gt;eventSource&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;onmessage&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;event&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;try&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;JSON&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;parse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;event&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="nx"&gt;logLines&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;value&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;push&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt; &lt;span class="c1"&gt;// Add new log entry to the reactive array&lt;/span&gt;

    &lt;span class="c1"&gt;// If the server signals that monitoring has completed, stop streaming&lt;/span&gt;
    &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;data&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;RAM monitoring completed&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="nf"&gt;closeEventSource&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt; &lt;span class="c1"&gt;// cleanup the connection and reset the UI components&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;catch &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;error&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;error&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Error parsing data:&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;error&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;};&lt;/span&gt;

&lt;span class="nx"&gt;eventSource&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;onerror&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;error&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;error&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;EventSource error:&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;error&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="nf"&gt;closeEventSource&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="p"&gt;};&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Explanation&lt;/strong&gt;: Once the server begins sending messages, &lt;code&gt;onmessage&lt;/code&gt; is triggered for each new piece of data. We parse the event, push the result to &lt;code&gt;logLines&lt;/code&gt;, and let Vue's reactivity do the rest—your UI updates automatically. If the message is "RAM monitoring completed", we assume the stream is done and clean up by calling &lt;code&gt;closeEventSource()&lt;/code&gt;. We also define &lt;code&gt;onerror&lt;/code&gt; here to catch any disconnections or issues. This ensures we always close the stream gracefully and reset the UI state if something goes wrong.&lt;/p&gt;

&lt;p&gt;💡 &lt;strong&gt;Reminder&lt;/strong&gt;: This entire block lives inside the &lt;code&gt;startStreaming()&lt;/code&gt; function, which is returned from &lt;code&gt;setup()&lt;/code&gt; and triggered when the user clicks the "Start Monitoring RAM" button.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Rendering the Data in HTML
&lt;/h3&gt;

&lt;p&gt;Finally, we connect the reactive Vue properties directly to our HTML using standard Vue template syntax. Take a look at the relevant snippet from &lt;code&gt;index.html&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight html"&gt;&lt;code&gt;&lt;span class="nt"&gt;&amp;lt;button&lt;/span&gt; &lt;span class="err"&gt;@&lt;/span&gt;&lt;span class="na"&gt;click=&lt;/span&gt;&lt;span class="s"&gt;"startStreaming"&lt;/span&gt; &lt;span class="na"&gt;:disabled=&lt;/span&gt;&lt;span class="s"&gt;"isStreaming"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;
  Start Monitoring RAM
&lt;span class="nt"&gt;&amp;lt;/button&amp;gt;&lt;/span&gt;
&lt;span class="nt"&gt;&amp;lt;div&lt;/span&gt; &lt;span class="na"&gt;class=&lt;/span&gt;&lt;span class="s"&gt;"log-container"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;
  &lt;span class="nt"&gt;&amp;lt;div&lt;/span&gt; &lt;span class="na"&gt;v-for=&lt;/span&gt;&lt;span class="s"&gt;"(line, index) in logLines"&lt;/span&gt; &lt;span class="na"&gt;:key=&lt;/span&gt;&lt;span class="s"&gt;"index"&lt;/span&gt; &lt;span class="na"&gt;class=&lt;/span&gt;&lt;span class="s"&gt;"log-line"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;
    {{ line }}
  &lt;span class="nt"&gt;&amp;lt;/div&amp;gt;&lt;/span&gt;
&lt;span class="nt"&gt;&amp;lt;/div&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Explanation&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The &lt;code&gt;@click&lt;/code&gt; directive on the button binds the &lt;code&gt;startStreaming&lt;/code&gt; function to the click event.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;:disabled="isStreaming"&lt;/code&gt; disables the button while the stream is active.&lt;/li&gt;
&lt;li&gt;The &lt;code&gt;v-for&lt;/code&gt; loop dynamically renders each message in &lt;code&gt;logLines&lt;/code&gt; as a new row in the UI.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;📁 You can view the complete HTML structure in &lt;code&gt;index.html&lt;/code&gt; in the repo if you'd like to see how the full layout is styled and structured.&lt;/p&gt;

&lt;h2&gt;
  
  
  Running the Demo Locally
&lt;/h2&gt;

&lt;p&gt;Now that we've walked through the backend setup and the frontend logic, it's time to see the demo in action.&lt;/p&gt;

&lt;p&gt;Clone the repository and move into the directory:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/Edwardvaneechoud/stream-logs-demo
&lt;span class="nb"&gt;cd &lt;/span&gt;stream-logs-demo
git checkout feature/simple_example
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Install the required dependencies and start the server:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;fastapi uvicorn psutil
python stream_logs.py
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now open your browser and navigate to &lt;a href="http://localhost:8000" rel="noopener noreferrer"&gt;http://localhost:8000&lt;/a&gt;. Click the "Start Monitoring RAM" button and watch the RAM usage update in real-time! If you look at your terminal where FastAPI is running, you'll notice in the logs that there is only one request (GET /api/stream) for the stream itself, however, your website is updating!&lt;/p&gt;

&lt;p&gt;You can explore the complete implementation of this demo at &lt;a href="https://github.com/Edwardvaneechoud/stream-logs-demo/tree/feature/simple_example" rel="noopener noreferrer"&gt;github.com/Edwardvaneechoud/stream-logs-demo/tree/feature/simple_example&lt;/a&gt; or checkout the more advanced example: &lt;a href="https://github.com/Edwardvaneechoud/stream-logs-demo/" rel="noopener noreferrer"&gt;github.com/Edwardvaneechoud/stream-logs-demo/&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Stay Tuned!
&lt;/h2&gt;

&lt;p&gt;This simple example demonstrates the core principles of Server-Sent Events. In my next post, I'll show how you can use SSE to stream real-time log information directly from your Python applications by implementing a custom logging handler. You'll learn how to capture and stream logs from any Python process without writing to files, making it easier to monitor and debug your applications in real-time. You'll see exactly how these principles scale to monitor complex data flows without polling.&lt;/p&gt;

&lt;p&gt;If you found this helpful, follow me to catch the upcoming deep-dive where I'll break down the monitoring implementation from the main branch, complete with code examples and performance insights from a real production environment.&lt;/p&gt;

&lt;h2&gt;
  
  
  Next Steps
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Extend the monitor to track CPU usage and disk I/O&lt;/li&gt;
&lt;li&gt;Build a real-time dashboard with multiple system metrics&lt;/li&gt;
&lt;li&gt;Try to handle different type of logs (warning, error, info, debug)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Happy streaming!&lt;/p&gt;

&lt;p&gt;If you're interested in seeing Server-Sent Events applied in a more complex, real-world scenario, you can explore my personal project, Flowfile. I utilize SSE extensively in that project, specifically for efficiently streaming live logs from running data pipeline processes directly into the UI console, providing immediate feedback much like the examples discussed here. You can check out the implementation and the project on GitHub: &lt;a href="https://github.com/Edwardvaneechoud/Flowfile" rel="noopener noreferrer"&gt;https://github.com/Edwardvaneechoud/Flowfile&lt;/a&gt;&lt;/p&gt;

</description>
      <category>fastapi</category>
      <category>vue</category>
      <category>python</category>
      <category>sse</category>
    </item>
    <item>
      <title>Building Flowfile: Architecting a Visual ETL Tool with Polars</title>
      <dc:creator>Edwardvaneechoud</dc:creator>
      <pubDate>Sat, 19 Apr 2025 13:32:40 +0000</pubDate>
      <link>https://forem.com/edwardvaneechoud/building-flowfile-architecting-a-visual-etl-tool-with-polars-576c</link>
      <guid>https://forem.com/edwardvaneechoud/building-flowfile-architecting-a-visual-etl-tool-with-polars-576c</guid>
      <description>&lt;p&gt;&lt;a href="https://github.com/edwardvaneechoud/Flowfile" rel="noopener noreferrer"&gt;Flowfile&lt;/a&gt; is my open-source project, born from trying to build an Alteryx alternative that combined two things I like: the straightforward visual style of drag-and-drop tools, and the challenge plus creative freedom you get coding with powerful libraries like &lt;a href="https://pola.rs/" rel="noopener noreferrer"&gt;Polars&lt;/a&gt;. Building it taught me a &lt;em&gt;ton&lt;/em&gt;, particularly about getting the architecture right. Since it was such an educational ride, I wanted to break down some of the technical bits here.&lt;/p&gt;

&lt;p&gt;So, for my first article here on dev.to, I'm excited to do just that – dive into those technical bits! This piece goes behind the scenes of Flowfile's architecture to reveal how data actually flows through the system, from the moment you drag a node onto the canvas right through to final execution. You'll learn about:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;How nodes and connections form a directed acyclic graph (DAG)&lt;/li&gt;
&lt;li&gt;The schema prediction system that provides real-time feedback without execution&lt;/li&gt;
&lt;li&gt;How execution modes leverage Polars' lazy evaluation for performance&lt;/li&gt;
&lt;li&gt;The worker-based execution model that optimizes memory usage&lt;/li&gt;
&lt;li&gt;Efficient inter-process communication using Apache Arrow IPC&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;By understanding these components, you'll gain insight into the design decisions made within Flowfile to balance user experience with performance.&lt;/p&gt;

&lt;h2&gt;
  
  
  The architecture: Three interconnected services
&lt;/h2&gt;

&lt;p&gt;Flowfile operates as three separate but interconnected components:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; &lt;strong&gt;Designer&lt;/strong&gt; (Electron + Vue): The visual interface users interact with.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Core&lt;/strong&gt; (FastAPI): The ETL engine that manages workflows, schema predictions, and orchestrates data transformations.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Worker&lt;/strong&gt; (FastAPI): Handles heavy computations and data caching.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqcqccd7c28o89wcfzlvj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqcqccd7c28o89wcfzlvj.png" alt="Flowfile's architecture" width="800" height="392"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Let's walk through what happens at each stage of building and executing a data pipeline.&lt;/p&gt;

&lt;h2&gt;
  
  
  Building a data flow: What happens when you drag a node?
&lt;/h2&gt;

&lt;p&gt;When you drag a node onto the canvas in the Designer, Flowfile initiates a series of coordinated actions between the front-end and back-end:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; &lt;strong&gt;Node creation&lt;/strong&gt;: The Designer sends a request to the Core service to create a new node of the specified type.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Graph update&lt;/strong&gt;: The Core service registers this node by creating a &lt;code&gt;NodeStep&lt;/code&gt; instance in the workflow graph with placeholder functionality.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Connection setup&lt;/strong&gt;: As the user connects the new node to existing nodes, the Designer communicates these connections to the Core service.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Input schema inference&lt;/strong&gt;: Based on the upstream nodes, the Core service infers the input data schema and sends it back to the Designer, allowing the user to immediately see what data is available for the selected action.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Node configuration&lt;/strong&gt;: The user configures the node’s settings in the Designer (e.g., selecting columns, specifying transformations).&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Schema prediction&lt;/strong&gt;: These settings are sent to the Core service, which uses them to calculate the predicted output schema.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Schema feedback&lt;/strong&gt;: The predicted schema is returned to the Designer and displayed in the interface, giving users real-time feedback on how their data will be shaped—before any execution occurs.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Iteration&lt;/strong&gt;: When the user adds the next node, the cycle starts again from step 1, allowing complex workflows to be built step-by-step with continuous feedback.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The constant interaction between the Core service and the designer decreases the chance of user errors and increases predictability. This schema prediction is one of Flowfile's most powerful features.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Core schema prediction function (simplified)
# Determines output structure without executing transformations
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_predicted_schema&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;List&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;FlowfileColumn&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Return the predicted schema for this node without executing the transformation.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;schema_callback&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="c1"&gt;# Use custom schema prediction if available
&lt;/span&gt;        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;schema_callback&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="c1"&gt;# Otherwise, infer from lazy evaluation by getting the schema from the plan
&lt;/span&gt;        &lt;span class="n"&gt;predicted_data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;_predicted_data_getter&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="c1"&gt;# Gets the Polars LazyFrame plan
&lt;/span&gt;        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;predicted_data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;schema&lt;/span&gt; &lt;span class="c1"&gt;# Polars can get schema from a LazyFrame
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Schema prediction: How Flowfile knows your data structure in advance
&lt;/h2&gt;

&lt;p&gt;Schema prediction occurs through one of two mechanisms:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; &lt;strong&gt;Schema Callbacks&lt;/strong&gt;: Custom functions, defined per node type, that calculate the expected output schema based on node configuration and input schemas.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Lazy Evaluation&lt;/strong&gt;: Utilizing Polars' ability to determine the schema of a planned transformation (&lt;code&gt;LazyFrame&lt;/code&gt;) without processing the full dataset.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This approach provides immediate feedback to users about the structure of their data pipeline without requiring expensive computation.&lt;/p&gt;

&lt;p&gt;For example, when you set up a "Group By" operation, a schema callback can tell you exactly what the output columns will be based on your aggregation choices, before processing any data:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Example of a schema callback for a Group By node
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;schema_callback&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;List&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;FlowfileColumn&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Calculate the output schema for a group by operation based on aggregation choices.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="c1"&gt;# Extract column info from settings (simplified representation)
&lt;/span&gt;    &lt;span class="n"&gt;output_columns&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[(&lt;/span&gt;&lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;old_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;new_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;output_type&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                      &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;group_by_settings&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;groupby_input&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;agg_cols&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

    &lt;span class="c1"&gt;# Get input schema info from the upstream node ('depends_on')
&lt;/span&gt;    &lt;span class="n"&gt;input_schema_dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;data_type&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;depends_on&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;schema&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="n"&gt;output_schema&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;List&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;FlowfileColumn&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;

    &lt;span class="c1"&gt;# Construct the output schema based on settings and input types
&lt;/span&gt;    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;old_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;new_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;data_type&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;output_columns&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="c1"&gt;# Infer data type if not explicitly set by the aggregation
&lt;/span&gt;        &lt;span class="n"&gt;data_type&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;input_schema_dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;old_name&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;data_type&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="n"&gt;data_type&lt;/span&gt;
        &lt;span class="n"&gt;output_schema&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;FlowfileColumn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_input&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data_type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;data_type&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;column_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;new_name&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;output_schema&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  The directed acyclic graph (DAG): The foundation of workflows
&lt;/h2&gt;

&lt;p&gt;As you add and connect nodes, Flowfile builds a Directed Acyclic Graph (DAG) where:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Nodes&lt;/strong&gt; represent data operations (read file, filter, join, write to database, etc.).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Edges&lt;/strong&gt; represent the flow of data between operations.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The DAG is managed by the &lt;code&gt;EtlGraph&lt;/code&gt; class in the Core service, which orchestrates the entire workflow:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;EtlGraph&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    Manages the ETL workflow as a DAG. Stores nodes, dependencies,
    and settings, and handles the execution order.
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;uuid&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;
    &lt;span class="n"&gt;_node_db&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;Union&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;NodeStep&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;  &lt;span class="c1"&gt;# Internal storage for all node steps
&lt;/span&gt;    &lt;span class="n"&gt;_flow_starts&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;List&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;NodeStep&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;               &lt;span class="c1"&gt;# Nodes that initiate data flow (e.g., readers)
&lt;/span&gt;    &lt;span class="n"&gt;_node_ids&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;List&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;Union&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;]]&lt;/span&gt;           &lt;span class="c1"&gt;# Tracking node identifiers
&lt;/span&gt;    &lt;span class="n"&gt;flow_settings&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;schemas&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;FlowSettings&lt;/span&gt;        &lt;span class="c1"&gt;# Global configuration for the flow
&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;add_node_step&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;node_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Union&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;function&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Callable&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="n"&gt;node_type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;kwargs&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Adds a new processing node (NodeStep) to the graph.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
        &lt;span class="n"&gt;node_step&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;NodeStep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;node_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;node_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;function&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;function&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;node_type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;node_type&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;kwargs&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_node_db&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;node_id&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;node_step&lt;/span&gt;
        &lt;span class="c1"&gt;# Additional logic to manage dependencies and flow starts...
&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;run_graph&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;RunInformation&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Executes the entire flow in the correct topological order.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
        &lt;span class="n"&gt;execution_order&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;topological_sort&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="c1"&gt;# Determine correct sequence
&lt;/span&gt;        &lt;span class="n"&gt;run_info&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;RunInformation&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;node&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;execution_order&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="c1"&gt;# Execute node based on mode (Development/Performance)
&lt;/span&gt;            &lt;span class="n"&gt;node_results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;node&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute_node&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="c1"&gt;# Simplified representation
&lt;/span&gt;            &lt;span class="n"&gt;run_info&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_result&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;node&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;node_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;node_results&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;run_info&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;topological_sort&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;List&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;NodeStep&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
        &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Determines the correct order to execute nodes based on dependencies.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
        &lt;span class="c1"&gt;# Standard DAG topological sort algorithm...
&lt;/span&gt;        &lt;span class="k"&gt;pass&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each &lt;code&gt;NodeStep&lt;/code&gt; in the graph encapsulates information about its dependencies, transformation logic, and output schema. This structure allows Flowfile to determine execution order, track data lineage, optimize performance, and provide schema predictions throughout the pipeline.&lt;/p&gt;

&lt;h2&gt;
  
  
  Leveraging lazy evaluation with Polars
&lt;/h2&gt;

&lt;p&gt;The real power of Flowfile comes from leveraging Polars' &lt;strong&gt;lazy evaluation&lt;/strong&gt;. Instead of processing data immediately at each step, Polars builds an optimized &lt;em&gt;execution plan&lt;/em&gt;. The actual computation is deferred until the result is explicitly requested (e.g., writing to a file or displaying data).&lt;/p&gt;

&lt;p&gt;This provides several significant benefits:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; &lt;strong&gt;Reduced Memory Usage&lt;/strong&gt;: Data is loaded and processed only when necessary, often streaming through operations without loading entire datasets into memory at once.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Query Optimization&lt;/strong&gt;: Polars analyzes the entire plan and can reorder, combine, or eliminate operations for maximum efficiency.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Parallel Execution&lt;/strong&gt;: Polars automatically parallelizes operations across available CPU cores during execution.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Predicate Pushdown&lt;/strong&gt;: Filters and selections are applied as early as possible in the plan, often directly at the data source level (like during file reading), minimizing the amount of data that needs to be processed downstream.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Consider reading, filtering, and writing data:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Traditional eager approach (less efficient):
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pl&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read_csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;large_file.csv&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# Reads entire file into memory immediately
&lt;/span&gt;&lt;span class="n"&gt;filtered_df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;filter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pl&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;value&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c1"&gt;# Filters the in-memory dataframe
&lt;/span&gt;&lt;span class="n"&gt;filtered_df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;write_parquet&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;output.parquet&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c1"&gt;# Writes the filtered result
&lt;/span&gt;
&lt;span class="c1"&gt;# Polars' lazy evaluation approach (efficient):
# 1. Build the plan (no data loaded yet)
&lt;/span&gt;&lt;span class="n"&gt;lazy_plan&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pl&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;scan_csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;large_file.csv&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c1"&gt;# Creates a plan to read the CSV
&lt;/span&gt;&lt;span class="n"&gt;filtered_plan&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;lazy_plan&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;filter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pl&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;value&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c1"&gt;# Adds filtering to the plan
&lt;/span&gt;
&lt;span class="c1"&gt;# 2. Execute the optimized plan (only happens here)
&lt;/span&gt;&lt;span class="n"&gt;result_df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;filtered_plan&lt;/span&gt;
&lt;span class="n"&gt;filtering&lt;/span&gt; &lt;span class="n"&gt;efficiently&lt;/span&gt;
&lt;span class="n"&gt;result_df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sink_parquet&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;output.parquet&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c1"&gt;# Writes the final result
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Flowfile uses this lazy approach extensively, especially in Performance Mode.&lt;/p&gt;

&lt;h2&gt;
  
  
  Execution modes: Development vs. Performance
&lt;/h2&gt;

&lt;p&gt;Flowfile offers two execution modes tailored for different needs:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Feature&lt;/th&gt;
&lt;th&gt;Development Mode&lt;/th&gt;
&lt;th&gt;Performance Mode&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Purpose&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Interactive debugging, step inspection&lt;/td&gt;
&lt;td&gt;Optimized execution for production/speed&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Execution&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Executes node-by-node&lt;/td&gt;
&lt;td&gt;Builds full plan, executes minimally&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Data Caching&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Caches intermediate results per step&lt;/td&gt;
&lt;td&gt;Minimal caching (only if specified/needed)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Preview Data&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Available for all nodes&lt;/td&gt;
&lt;td&gt;Only for final/cached nodes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Memory Usage&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Potentially higher&lt;/td&gt;
&lt;td&gt;Generally lower&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Speed&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Moderate&lt;/td&gt;
&lt;td&gt;Faster for complex flows&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Development Mode
&lt;/h3&gt;

&lt;p&gt;In Development mode, each node's transformation is triggered sequentially. After a node executes (within the Worker service), its intermediate result is typically serialized using &lt;strong&gt;Apache Arrow IPC format&lt;/strong&gt; and cached to disk. This allows you to inspect the data at each step in the Designer via small samples fetched from the cache.&lt;/p&gt;

&lt;h3&gt;
  
  
  Performance Mode
&lt;/h3&gt;

&lt;p&gt;In Performance mode, Flowfile fully embraces Polars' lazy evaluation:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; The Core service constructs the &lt;em&gt;entire&lt;/em&gt; Polars execution plan based on the DAG.&lt;/li&gt;
&lt;li&gt; This plan (&lt;code&gt;LazyFrame&lt;/code&gt;) is passed to the Worker service.&lt;/li&gt;
&lt;li&gt; The Worker only &lt;em&gt;materializes&lt;/em&gt; (executes &lt;code&gt;.collect()&lt;/code&gt; or &lt;code&gt;.sink_*()&lt;/code&gt;) the plan when:

&lt;ul&gt;
&lt;li&gt;An output node (like writing to a file) requires the final result.&lt;/li&gt;
&lt;li&gt;A node is explicitly configured to cache its results (&lt;code&gt;node.cache_results&lt;/code&gt;).&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This minimizes computation and memory usage by avoiding unnecessary intermediate materializations.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Execution logic in Performance Mode (simplified)
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;execute_performance_mode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;node&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;NodeStep&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;is_output_node&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;bool&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Handles execution in performance mode, leveraging lazy evaluation.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;is_output_node&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="n"&gt;node&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cache_results&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="c1"&gt;# If result is needed (output or caching), trigger execution in Worker
&lt;/span&gt;        &lt;span class="n"&gt;external_df_fetcher&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ExternalDfFetcher&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;lf&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;node&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_resulting_data&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="n"&gt;data_frame&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="c1"&gt;# Pass the LazyFrame plan
&lt;/span&gt;            &lt;span class="n"&gt;file_ref&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;node&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;hash&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="c1"&gt;# Unique reference for caching
&lt;/span&gt;            &lt;span class="n"&gt;wait_on_completion&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt; &lt;span class="c1"&gt;# Usually async
&lt;/span&gt;        &lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="c1"&gt;# Worker executes .collect() or .sink_*() and caches if needed
&lt;/span&gt;        &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;external_df_fetcher&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_result&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="c1"&gt;# May return LazyFrame or trigger compute
&lt;/span&gt;        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="c1"&gt;# Or potentially just confirmation if sinking
&lt;/span&gt;    &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="c1"&gt;# If not output/cached, just pass the LazyFrame plan along
&lt;/span&gt;        &lt;span class="c1"&gt;# No computation happens here for intermediate nodes
&lt;/span&gt;        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;node&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_resulting_data&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="n"&gt;data_frame&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Crucially, &lt;strong&gt;all actual data processing and materialization of Polars DataFrames/LazyFrames happens in the Worker service&lt;/strong&gt;. This separation prevents large datasets from overwhelming the Core service, ensuring the UI remains responsive.&lt;/p&gt;

&lt;h2&gt;
  
  
  Efficient Inter-process Communication (IPC) Between Core and Worker
&lt;/h2&gt;

&lt;p&gt;Since Core orchestrates and Worker computes, efficient communication is vital. Flowfile uses &lt;strong&gt;Apache Arrow IPC format&lt;/strong&gt; for data exchange:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; &lt;strong&gt;Worker Processing&lt;/strong&gt;: When the Worker needs to materialize data (either intermediate cache in Dev mode or final results), it computes the Polars DataFrame.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Serialization &amp;amp; Caching&lt;/strong&gt;: The resulting DataFrame is serialized into the efficient Arrow IPC binary format and saved to a temporary file on disk. This file acts as a cache, identified by a unique hash (&lt;code&gt;file_ref&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Reference Passing&lt;/strong&gt;: The Worker informs the Core that the result is ready at the specified &lt;code&gt;file_ref&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Core Fetching&lt;/strong&gt;: If the Core (or subsequently, another Worker task) needs this data, it uses the &lt;code&gt;file_ref&lt;/code&gt; to access the cached Arrow file directly. This avoids sending large datasets over network sockets between processes.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;UI Sampling&lt;/strong&gt;: For UI previews, the Core requests a small sample (e.g., the first 100 rows) from the Worker. The Worker reads just the sample from the Arrow IPC file and sends only that lightweight data back to the Core, which forwards it to the Designer.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Here’s how the Core might offload computation to the Worker:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Core side - Initiating remote execution in the Worker
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;execute_remote&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;performance_mode&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;bool&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Offloads the execution of a node&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s LazyFrame to the Worker service.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="c1"&gt;# Create a fetcher instance to manage communication with the Worker
&lt;/span&gt;    &lt;span class="n"&gt;external_df_fetcher&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ExternalDfFetcher&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;lf&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_resulting_data&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="n"&gt;data_frame&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="c1"&gt;# The Polars LazyFrame plan
&lt;/span&gt;        &lt;span class="n"&gt;file_ref&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;hash&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;                      &lt;span class="c1"&gt;# Unique identifier for the result/cache
&lt;/span&gt;        &lt;span class="n"&gt;wait_on_completion&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;                &lt;span class="c1"&gt;# Operate asynchronously
&lt;/span&gt;        &lt;span class="c1"&gt;# Pass other necessary context like flow_id, node_id, operation type...
&lt;/span&gt;    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Store the fetcher to potentially retrieve results later
&lt;/span&gt;    &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_fetch_cached_df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;external_df_fetcher&lt;/span&gt;

    &lt;span class="c1"&gt;# Request the Worker to start processing (this returns quickly)
&lt;/span&gt;    &lt;span class="c1"&gt;# The actual computation happens asynchronously in the Worker
&lt;/span&gt;    &lt;span class="n"&gt;external_df_fetcher&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;start_processing_in_worker&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="c1"&gt;# Hypothetical method name
&lt;/span&gt;
    &lt;span class="c1"&gt;# If we immediately need the result (e.g., for a subsequent synchronous step):
&lt;/span&gt;    &lt;span class="c1"&gt;# lf = external_df_fetcher.get_result() # This would block until Worker is done
&lt;/span&gt;    &lt;span class="c1"&gt;# self.results.resulting_data = FlowfileTable(lf)
&lt;/span&gt;
    &lt;span class="c1"&gt;# For UI updates, request a sample separately
&lt;/span&gt;    &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;store_example_data_generator&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;external_df_fetcher&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c1"&gt;# Fetches sample async
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And how the Worker might manage the separate process execution:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Worker side - Managing computation in a separate process
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;start_process&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;polars_serializable_object&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;bytes&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="c1"&gt;# Serialized LazyFrame plan
&lt;/span&gt;    &lt;span class="n"&gt;task_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;file_ref&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="c1"&gt;# Path for cached output (Arrow IPC file)
&lt;/span&gt;    &lt;span class="c1"&gt;# ... other args like operation type
&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Launches a separate OS process to handle the heavy computation.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="c1"&gt;# Use multiprocessing context for safety
&lt;/span&gt;    &lt;span class="n"&gt;mp_context&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;multiprocessing&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_context&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;spawn&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c1"&gt;# or 'fork' depending on OS/needs
&lt;/span&gt;
    &lt;span class="c1"&gt;# Shared memory/queue for progress tracking and results/errors
&lt;/span&gt;    &lt;span class="n"&gt;progress&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;mp_context&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Value&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;i&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c1"&gt;# Shared integer for progress %
&lt;/span&gt;    &lt;span class="n"&gt;error_message&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;mp_context&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Array&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;c&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1024&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c1"&gt;# Shared buffer for error messages
&lt;/span&gt;    &lt;span class="n"&gt;queue&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;mp_context&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Queue&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;maxsize&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c1"&gt;# For potentially passing back results (or file ref)
&lt;/span&gt;
    &lt;span class="c1"&gt;# Define the target function and arguments for the new process
&lt;/span&gt;    &lt;span class="n"&gt;process&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;mp_context&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Process&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;target&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;process_task&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="c1"&gt;# The function that runs Polars .collect()/.sink()
&lt;/span&gt;        &lt;span class="n"&gt;kwargs&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;polars_serializable_object&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;polars_serializable_object&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;progress&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;progress&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;error_message&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;error_message&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;queue&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;queue&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;file_path&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;file_ref&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="c1"&gt;# Where to save the Arrow IPC output
&lt;/span&gt;            &lt;span class="c1"&gt;# ... other necessary kwargs
&lt;/span&gt;        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;process&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;start&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="c1"&gt;# Launch the independent process
&lt;/span&gt;
    &lt;span class="c1"&gt;# Monitor the task (e.g., update status in a database, check progress)
&lt;/span&gt;    &lt;span class="nf"&gt;handle_task&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;task_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;process&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;progress&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;error_message&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;queue&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This architecture ensures:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; &lt;strong&gt;Responsiveness&lt;/strong&gt;: The Core service remains light and responsive to UI interactions.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Memory Isolation&lt;/strong&gt;: Heavy memory usage is contained within transient Worker processes.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Efficiency&lt;/strong&gt;: Arrow IPC provides fast serialization/deserialization and efficient disk caching.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Scalability&lt;/strong&gt;: Allows handling datasets larger than the memory of any single service by processing chunks or relying on Polars' streaming capabilities where applicable.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Conclusion: Combining Visual Simplicity with Polars' Power
&lt;/h2&gt;

&lt;p&gt;Flowfile tries to bridge the gap between the intuitive, visual workflow building familiar from tools like Alteryx or KNIME, and the raw performance and flexibility you get from modern data libraries like Polars.&lt;/p&gt;

&lt;p&gt;By architecting the system around Polars' lazy evaluation, providing real-time schema feedback, and employing a robust multi-process model with efficient Arrow IPC communication, Flowfile delivers a user-friendly experience without sacrificing the speed needed for demanding data tasks. It's designed as a powerful tool for data analysts who prefer visual interfaces and data engineers who need performance and control.&lt;/p&gt;

&lt;p&gt;Flowfile is intended as a performant, intuitive, and open-source solution for modern data transformation challenges.&lt;/p&gt;




&lt;p&gt;If you're interested in trying Flowfile, exploring its capabilities, or contributing to its development, check out the &lt;a href="https://github.com/edwardvaneechoud/Flowfile" rel="noopener noreferrer"&gt;&lt;strong&gt;Flowfile GitHub repository&lt;/strong&gt;&lt;/a&gt;. Feedback and contributions are highly encouraged!&lt;/p&gt;




&lt;p&gt;&lt;em&gt;What are your favorite visual ETL tools, or how do you combine visual and code-based approaches in your data workflows? Share your thoughts in the comments below!&lt;/em&gt;&lt;/p&gt;

</description>
      <category>python</category>
      <category>polars</category>
      <category>architecture</category>
      <category>showdev</category>
    </item>
  </channel>
</rss>
