<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Ian Cowley</title>
    <description>The latest articles on Forem by Ian Cowley (@iancowley).</description>
    <link>https://forem.com/iancowley</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3928889%2F2cfd8346-cffe-47c8-8e57-0aef2a0a4abc.jpeg</url>
      <title>Forem: Ian Cowley</title>
      <link>https://forem.com/iancowley</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/iancowley"/>
    <language>en</language>
    <item>
      <title>I built a native C# DataFrame engine to rival Python Polars (it's actually faster on some things)</title>
      <dc:creator>Ian Cowley</dc:creator>
      <pubDate>Wed, 13 May 2026 10:32:44 +0000</pubDate>
      <link>https://forem.com/iancowley/i-built-a-native-c-dataframe-engine-to-rival-python-polars-its-actually-faster-on-some-things-opg</link>
      <guid>https://forem.com/iancowley/i-built-a-native-c-dataframe-engine-to-rival-python-polars-its-actually-faster-on-some-things-opg</guid>
      <description>&lt;h1&gt;
  
  
  I built a native C# DataFrame engine to rival Python Polars (it's actually faster on some things)
&lt;/h1&gt;

&lt;p&gt;If you work in data science or heavy data engineering, you already know about &lt;a href="https://pola.rs/" rel="noopener noreferrer"&gt;Polars&lt;/a&gt;. It’s the Rust-backed powerhouse that took the Python ecosystem by storm, leaving Pandas in the dust.&lt;/p&gt;

&lt;p&gt;But if you’re a .NET developer, the data manipulation story has always been a bit… frustrating. We have &lt;code&gt;Microsoft.Data.Analysis&lt;/code&gt;, but it lacks the expressive lazy API and raw speed we crave. We often end up exporting data to Python just to process it, only to bring it back to C#.&lt;/p&gt;

&lt;p&gt;I got tired of waiting for a native .NET solution. So, I decided to build one from scratch.&lt;/p&gt;

&lt;p&gt;Meet &lt;strong&gt;&lt;a href="https://github.com/ian-cowley/Glacier.Polaris" rel="noopener noreferrer"&gt;Glacier.Polaris&lt;/a&gt;&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;It is a high-performance, strongly-typed DataFrame library for C# (.NET 10). It features SIMD-accelerated compute kernels, a lazy execution engine, native nullability (Kleene logic), and it currently passes 135/135 golden-file parity tests against Python Polars.&lt;/p&gt;

&lt;p&gt;And after weeks of fighting the .NET JIT compiler and CPU caches, it is actually beating Polars in several key benchmarks.&lt;/p&gt;

&lt;p&gt;Here is how I pushed C# to its physical limits to pull this off.&lt;/p&gt;




&lt;h2&gt;
  
  
  1. Zero-Allocation and SIMD String Filtering
&lt;/h2&gt;

&lt;p&gt;In standard C#, string operations are heavy. If you filter a DataFrame with &lt;code&gt;df.Filter(Expr.Col("Status") == "Completed")&lt;/code&gt;, checking materialized &lt;code&gt;.NET&lt;/code&gt; string objects one by one will instantly ruin your performance due to pointer-chasing and heap allocations.&lt;/p&gt;

&lt;p&gt;To beat Polars (which uses Arrow's contiguous memory format), I couldn't use C# strings.&lt;/p&gt;

&lt;p&gt;Instead, Glacier.Polaris stores strings as flat UTF-8 byte arrays. When you execute an equality filter, the engine loads your target string into a &lt;code&gt;Vector256&amp;lt;byte&amp;gt;&lt;/code&gt; register. As it scans the 10-million row DataFrame, it fires a single AVX2 instruction (&lt;code&gt;Vector256.Equals&lt;/code&gt;) that compares entire words simultaneously against the target bytes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Result:&lt;/strong&gt; String exact-match filtering in Glacier runs in &lt;strong&gt;3.62 ms&lt;/strong&gt; for 1 million rows, beating Polars (&lt;strong&gt;~4.2ms&lt;/strong&gt;) with zero string allocations.&lt;/p&gt;

&lt;h2&gt;
  
  
  2. Breaking the 4ms Barrier: The Float64 Sorting War
&lt;/h2&gt;

&lt;p&gt;The hardest fight I had was with &lt;code&gt;ArgSort&lt;/code&gt; on Float64 data.&lt;/p&gt;

&lt;p&gt;Initially, I wrote a highly optimized, single-threaded Radix sort. I managed to drop the sorting time for 1 million floats to &lt;strong&gt;13.11 ms&lt;/strong&gt;—a massive 5.4x speedup over the standard .NET &lt;code&gt;Array.Sort&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;But Polars was doing it in &lt;strong&gt;4.21 ms&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;At 13ms, my C# code had officially maxed out the physical capabilities of a single CPU core. Moving 200MB of data (keys and indices) over 8 radix passes requires about 47 GB/s of memory bandwidth. A single core physically taps out around 15-20 GB/s.&lt;/p&gt;

&lt;p&gt;I needed to parallelize it. But .NET's &lt;code&gt;Parallel.For&lt;/code&gt; has too much overhead; spinning up the ThreadPool state machine takes 1-2ms alone, which is a death sentence when your target is 4ms.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Fix: The Parallel Block Tournament Merge&lt;/strong&gt;&lt;br&gt;
Instead of using standard .NET parallel loops, I built a custom generic parallel block merge engine:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;The engine slices the 1M array into isolated chunks and hands them to raw &lt;code&gt;Task&lt;/code&gt; objects.&lt;/li&gt;
&lt;li&gt;Each core executes a single-threaded Radix sort entirely inside its &lt;strong&gt;L2 Cache&lt;/strong&gt;, meaning it never talks to system RAM, avoiding Translation Lookaside Buffer (TLB) thrashing.&lt;/li&gt;
&lt;li&gt;The engine merges the sorted chunks using a stable, parallel pairwise tournament merge.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;The Result:&lt;/strong&gt; Float64 sorting dropped to &lt;strong&gt;12.05 ms&lt;/strong&gt; for 1M rows, and successfully scaled to sort 10 Million rows in just &lt;strong&gt;84.71 ms&lt;/strong&gt;.&lt;/p&gt;
&lt;h2&gt;
  
  
  3. The Benchmarks (C# vs Polars)
&lt;/h2&gt;

&lt;p&gt;I ran these benchmarks on the same machine, comparing Glacier.Polaris (.NET 10 Release build) against Polars 1.40.1.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;(Note: Times are in milliseconds. Lower is better).&lt;/em&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Operation (1M Rows)&lt;/th&gt;
&lt;th&gt;Glacier.Polaris (C#)&lt;/th&gt;
&lt;th&gt;Python Polars&lt;/th&gt;
&lt;th&gt;Winner&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;DataFrame Creation&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;0.02 ms&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;5.33 ms&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;🟢 C# (~266x)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Sum (Int32)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;0.14 ms&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;0.45 ms&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;🟢 C# (3.2x)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Standard Deviation&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;0.33 ms&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;0.55 ms&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;🟢 C# (1.7x)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;GroupBy Sum (Int32)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;1.56 ms&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;5.20 ms&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;🟢 C# (3.3x)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Inner Join (Small Right)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;2.29 ms&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;4.61 ms&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;🟢 C# (2.0x)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Rolling StdDev&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;3.15 ms&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;12.92 ms&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;🟢 C# (4.1x)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;By utilizing single-pass Welford algorithms for variance, contiguous memory, and a custom Fibonacci-hashing hash map for joins, C# absolutely flies.&lt;/p&gt;
&lt;h2&gt;
  
  
  Try it out
&lt;/h2&gt;

&lt;p&gt;Glacier.Polaris covers ~98% of the Python Polars core surface area, including LazyFrames, query optimization (predicate/projection pushdowns), and full temporal operations.&lt;/p&gt;

&lt;p&gt;If you are building high-performance data pipelines, backtesting financial algorithms, or doing ML preprocessing in .NET, I’d love for you to try it out.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;GitHub:&lt;/strong&gt; &lt;a href="https://github.com/ian-cowley/Glacier.Polaris" rel="noopener noreferrer"&gt;https://github.com/ian-cowley/Glacier.Polaris&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;NuGet:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;dotnet add package Glacier.Polaris

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Star the repo, try to break the lazy execution engine, and let me know what features you want to see next! Let's bring world-class data engineering to .NET.&lt;/p&gt;

</description>
      <category>dotnet</category>
      <category>csharp</category>
      <category>polars</category>
      <category>datascience</category>
    </item>
  </channel>
</rss>
