<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Gabriel</title>
    <description>The latest articles on Forem by Gabriel (@gabriel-p).</description>
    <link>https://forem.com/gabriel-p</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3600628%2Faf9885de-0e69-46d3-b0cb-df8add6cb826.jpeg</url>
      <title>Forem: Gabriel</title>
      <link>https://forem.com/gabriel-p</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/gabriel-p"/>
    <language>en</language>
    <item>
      <title>We Stopped Reaching for PySpark by Habit. Polars Made Our Small Jobs Boringly Fast.</title>
      <dc:creator>Gabriel</dc:creator>
      <pubDate>Fri, 07 Nov 2025 08:54:16 +0000</pubDate>
      <link>https://forem.com/gabriel-p/we-stopped-reaching-for-pyspark-by-habit-polars-made-our-small-jobs-boringly-fast-2d93</link>
      <guid>https://forem.com/gabriel-p/we-stopped-reaching-for-pyspark-by-habit-polars-made-our-small-jobs-boringly-fast-2d93</guid>
      <description>&lt;p&gt;You know those “we migrated and everything is 10x faster” posts that leave out the messy bits? This isn’t one of them.&lt;/p&gt;

&lt;p&gt;I'm a data engineer working in financial services, partnering with Palantir on one of our in-house strategic platforms*. Big, distributed data is part of the day job, so PySpark is the comfortable hoodie we’ve worn for years. But here’s the plot twist: for our small to mid-sized datasets (think: tens of MBs to a few GBs, not petabytes), we started swapping PySpark pipelines for Polars. And the dev loop went from coffee-break to “wait, it’s done?”&lt;/p&gt;

&lt;p&gt;Let me tell you how that happened, where Polars shines, where Spark still wins, and exactly how to translate those “Spark-isms” you’ve internalized into Polars without wanting to throw your laptop.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;*Disclaimer: The code and project shown here are my personal work and are not affiliated with, endorsed by, or the property of my employer.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h3&gt;
  
  
  TL;DR (for the colleague skimming this on a train)
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Polars vs PySpark for small to mid-sized data:&lt;/strong&gt; Polars’ Rust engine, lazy evaluation, and expression API deliver faster runtimes and quicker iteration on a single machine. Keep Spark for cluster-scale workloads, governed lakehouse writes, and production streaming. For migration, replace UDFs with expressions, lean on &lt;code&gt;join_asof&lt;/code&gt;, &lt;code&gt;group_by_dynamic&lt;/code&gt;, &lt;code&gt;struct/list&lt;/code&gt; ops, and &lt;code&gt;sink_parquet()&lt;/code&gt; for outputs.&lt;/p&gt;




&lt;h4&gt;
  
  
  The short version (so you can send this to your team chat)
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;If your data fits on a single beefy machine, Polars will often run circles around PySpark and make your codebase smaller.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;When you actually need a cluster, a metastore, ACID lakehouse writes, or battle-hardened streaming semantics, Spark is still your friend.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;The switch is less painful than you think. Expressions &amp;gt; UDFs, embrace lazy queries, and learn three or four Polars-specific tricks (as-of joins, dynamic windows, struct/list ops).&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;h4&gt;
  
  
  A quick story from the trenches
&lt;/h4&gt;

&lt;p&gt;We had a workforce modeling DAG that joined dimension tables, applied a thicket of conditional business rules, and rolled up time-based metrics. In PySpark, the job was fine, just… ritualistic: spin a session, serialize UDFs, wait out the cluster shuffle tax, and go stare out the window while Catalyst did its thing.&lt;br&gt;
One day we rewrote the same pipeline in Polars with lazy scans and expressions. No UDFs. No session boot. Same logic, fewer lines. The run finished before the “I’ll grab a coffee” instinct completed. The drama? The cluster didn’t even get invited to the meeting.&lt;/p&gt;


&lt;h4&gt;
  
  
  Why Polars works so well on “normal-sized” data
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;It’s Rust under the hood.&lt;/strong&gt; Vectorized, multi-threaded, cache-friendly. Python just orchestrates expressions.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Lazy by default (when you choose it).&lt;/strong&gt; You build a plan (&lt;code&gt;scan_parquet&lt;/code&gt; -&amp;gt; &lt;code&gt;transforms&lt;/code&gt; -&amp;gt; &lt;code&gt;collect()&lt;/code&gt;), and Polars fuses steps to avoid unnecessary passes.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Expression API beats UDFs.&lt;/strong&gt; The minute your logic is expressible, you stop paying Python &amp;lt;-&amp;gt; JVM costs and serialization overhead.&lt;/li&gt;
&lt;/ul&gt;


&lt;h3&gt;
  
  
  The non-obvious differences that actually matter
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5ac88h07qyndxp2remw5.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5ac88h07qyndxp2remw5.png" alt="polars vs pyspark" width="800" height="439"&gt;&lt;/a&gt;&lt;/p&gt;


&lt;h3&gt;
  
  
  “But Spark can do X and I don’t see it in Polars…” (Here’s how to do it)
&lt;/h3&gt;

&lt;p&gt;Below are the patterns we actually used while migrating pipelines.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Time-range join (point-in-time lookup)&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# PySpark
from pyspark.sql import functions as F, Window as W

# For each event, attach the most recent dimension row effective &amp;lt;= event.ts
cond = [events.abc == dim.abc, events.ts &amp;gt;= dim.eff_ts]
joined = (events.join(dim, cond, "left")
          .withColumn("rn", F.row_number().over(
              W.partitionBy("event_id").orderBy(F.col("dim.eff_ts").desc())
          ))
          .filter("rn=1").drop("rn"))

# --- same idea in Polars ---
# Polars
import polars as pl

events = pl.scan_parquet("events.parquet")
dim    = pl.scan_parquet("dim.parquet")

# Natural fit: as-of join on timestamp with a key
out = events.join_asof(
    dim, on="ts", by="abc", strategy="backward", tolerance="5m"
).collect()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;When &lt;code&gt;join_asof&lt;/code&gt; isn’t enough&lt;/strong&gt; (e.g., need &lt;code&gt;ts BETWEEN start AND end&lt;/code&gt;): pre-expand dimension rows into &lt;code&gt;(start, end)&lt;/code&gt; boundaries and filter after a cheap join:&lt;br&gt;
&lt;/p&gt;
&lt;/blockquote&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Small dim? Cross then filter (ok for small) or prebucket by day.
out = (events.join(dim, on="abc", how="left")
             .filter((pl.col("ts") &amp;gt;= pl.col("start")) &amp;amp; (pl.col("ts") &amp;lt; pl.col("end")))
             .with_columns(
                 pl.col("ts").sort_by("ts").over("event_id")  # keep stability if needed
             )
      ).collect()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;2. Complex conditional with many CASE WHENs&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# PySpark
from pyspark.sql import functions as F

df = df.withColumn(
    "final_score",
    F.when(F.col("score") &amp;gt;= 90, "A")
     .when(F.col("score") &amp;gt;= 75, "B")
     .when((F.col("score") &amp;gt;= 60) &amp;amp; (F.col("region") == "Lake Region"), "C+")
     .otherwise("C")
)

# --- same idea in Polars ---
# Polars
import polars as pl

df = df.with_columns([
    pl.when(pl.col("score") &amp;gt;= 90).then("A")
     .when(pl.col("score") &amp;gt;= 75).then("B")
     .when((pl.col("score") &amp;gt;= 60) &amp;amp; (pl.col("region") == "Lake Region")).then("C+")
     .otherwise("C")
     .alias("final_score")
])
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;The nested &lt;code&gt;when/then/otherwise&lt;/code&gt; in Polars is ergonomic and compiles into a single expression tree.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;3. “Collect a struct, then explode it later”&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;#PySpark
aggd = (df.groupBy("forest")
          .agg(F.collect_list(F.struct("region","place")).alias("full_place")))
flat = aggd.select("forest", F.explode_outer("full_place").alias("p")).select("forest","p.*")

# --- same idea in Polars ---
# Polars
aggd = (df.group_by("forest")
          .agg(pl.struct(["region","place"]).list().alias("full_place")))

flat = (aggd
        .with_columns(pl.col("full_place").explode())
        .unnest("full_place"))
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;unnest&lt;/code&gt; on &lt;code&gt;struct&lt;/code&gt; is a great trick: no manual field projection necessary.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;4. Rolling window over time with business calendars&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;#PySpark
from pyspark.sql import functions as F, Window as W

w = W.partitionBy("abc").orderBy("ts").rowsBetween(-6, 0)
out = df.withColumn("w7", F.avg("value").over(w))

# --- same idea in Polars ---
# Polars
out = (df.sort(["abc","ts"])
         .with_columns(
             pl.col("value").rolling_mean(window_size=7).over("abc").alias("w7")
         ))
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;If your index is a timestamp with irregular spacing, reach for &lt;code&gt;group_by_dynamic&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;
&lt;/blockquote&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;roll = (df.group_by_dynamic(index_column="ts", every="1w", period="1w", by="abc")
          .agg(pl.col("value").mean().alias("w7")))
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;5. Broadcast a tiny dimension without… a hint&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;#PySpark
small_dim = spark.table("dim_small")
df = df.hint("broadcast").join(small_dim, "key", "left")

# --- same idea in Polars ---
# Polars
# Just join. If the right side is tiny, Polars keeps it in memory.
df = df.join(small_dim, on="key", how="left")
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;No hint needed on a single machine; Polars will do the sensible thing.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h4&gt;
  
  
  Where we still reach for Spark
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Data too big for one box.&lt;/strong&gt; If your joins or shuffles exceed a single host’s RAM/IO comfort zone, you want a cluster.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Lakehouse governance.&lt;/strong&gt; Need native &lt;em&gt;Delta/Iceberg&lt;/em&gt; writes with compaction, vacuuming, and a metastore? Spark’s the grown-up in the room.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Production streaming semantics.&lt;/strong&gt; Exactly-once sinks, watermarking across stateful operators — Spark’s Structured Streaming is still the benchmark.&lt;/li&gt;
&lt;/ul&gt;




&lt;h4&gt;
  
  
  A tiny migration playbook (what we actually did)
&lt;/h4&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Start lazy.&lt;/strong&gt; Prefer &lt;code&gt;pl.scan_parquet(...)&lt;/code&gt; -&amp;gt; &lt;code&gt;transform&lt;/code&gt; -&amp;gt; &lt;code&gt;collect()/sink_parquet()&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Delete UDFs.&lt;/strong&gt; If you’re writing Python UDFs in Spark, translate them into expressions in Polars. It’s almost always possible.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Explode early, unnest often.&lt;/strong&gt; Nested data becomes readable with &lt;code&gt;unnest&lt;/code&gt;, &lt;code&gt;explode&lt;/code&gt;, &lt;code&gt;arr ops&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Test with golden data.&lt;/strong&gt; Same sample through both engines; compare row counts, sums, and a few canonical hashes.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Keep Spark for the big guns.&lt;/strong&gt; Don’t martyr yourself trying to force a 2-TB join into one box.&lt;/li&gt;
&lt;/ol&gt;




&lt;h4&gt;
  
  
  Two-step migration (no drama version)
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Step 1: Like-for-like swap.&lt;/strong&gt; We replaced PySpark with Polars while keeping the business logic identical. Built the plan with &lt;code&gt;scan_*&lt;/code&gt; + expressions, then checked parity with row counts, hashes, and a few KPI spot checks. Goal: prove equivalence, not chase speed.&lt;br&gt;
&lt;strong&gt;Step 2: Optimize for Polars.&lt;/strong&gt; We refactored to lean on lazy evaluation and vectorized expressions, trimmed intermediates, and ditched Python UDFs.&lt;br&gt;
The payoff came in two clean drops: first from shedding distributed overhead (think ~5 -&amp;gt; ~2 minutes), then to consistently sub-minute after the Polars-native rewrite. Fewer lines, steadier memory, snappier UI.&lt;/p&gt;




&lt;h4&gt;
  
  
  Gotchas we hit (so you don’t)
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Null propagation:&lt;/strong&gt; Polars is strict. Use &lt;code&gt;fill_null&lt;/code&gt;, &lt;code&gt;coalesce&lt;/code&gt; (&lt;code&gt;pl.coalesce([colA, colB])&lt;/code&gt;) intentionally.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Types are not vibes.&lt;/strong&gt; Specify &lt;code&gt;dtype&lt;/code&gt; on read if your CSVs are “creative.”&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Non-equi joins:&lt;/strong&gt; Reach for &lt;code&gt;join_asof&lt;/code&gt; or filter-after-join patterns. If it looks like a cross join… it is a cross join—use it sparingly.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Don’t &lt;code&gt;apply&lt;/code&gt; unless you must.&lt;/strong&gt; If you write a Python lambda over rows, you’re leaving performance on the floor. Stick to expressions.&lt;/li&gt;
&lt;/ul&gt;




&lt;h4&gt;
  
  
  What I’d do on Monday
&lt;/h4&gt;

&lt;p&gt;Pick a pipeline under a few GB, port the transforms you can &lt;strong&gt;without UDFs&lt;/strong&gt;, wire up &lt;code&gt;scan_*&lt;/code&gt; -&amp;gt; &lt;code&gt;expressions&lt;/code&gt; -&amp;gt; &lt;code&gt;collect()/sink_parquet()&lt;/code&gt;, and compare outputs. If your run time drops and your code shrinks, congratulations! You just freed up time for the hard problems.&lt;br&gt;
If it doesn’t? That probably means you genuinely need Spark’s distributed muscle. Use the right tool and move on with your life.&lt;/p&gt;

</description>
      <category>data</category>
      <category>pyspark</category>
      <category>polars</category>
    </item>
  </channel>
</rss>
