<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Avik</title>
    <description>The latest articles on Forem by Avik (@avik12345678).</description>
    <link>https://forem.com/avik12345678</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3606201%2F1df24695-1d08-41c4-9677-8b7d4635e8a1.jpg</url>
      <title>Forem: Avik</title>
      <link>https://forem.com/avik12345678</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/avik12345678"/>
    <language>en</language>
    <item>
      <title>Starting Dusty — A Tiny DSL for ETL &amp; Research Data Cleaning</title>
      <dc:creator>Avik</dc:creator>
      <pubDate>Thu, 11 Dec 2025 11:55:17 +0000</pubDate>
      <link>https://forem.com/avik12345678/starting-dusty-a-tiny-dsl-for-etl-research-data-cleaning-29g5</link>
      <guid>https://forem.com/avik12345678/starting-dusty-a-tiny-dsl-for-etl-research-data-cleaning-29g5</guid>
      <description>&lt;p&gt;For the last few weeks I’ve been thinking seriously about building my own programming language. Not a big general-purpose language, not a Python replacement, and definitely not something with heavy ambitions. I just wanted to create something small, useful, and focused.&lt;/p&gt;

&lt;p&gt;That’s where Dusty comes in.&lt;/p&gt;

&lt;p&gt;Dusty is a lightweight DSL (domain-specific language) designed only for ETL tasks and research data cleaning. Nothing more. No huge ecosystem, no package manager, no frameworks. The entire goal is simple:&lt;/p&gt;

&lt;p&gt;turn messy CSV/JSON cleaning work into short, readable scripts.&lt;/p&gt;

&lt;p&gt;I’m starting with problems I’ve personally faced. Whenever I work on research data or hackathon datasets, I end up writing the same pattern again and again:&lt;/p&gt;

&lt;p&gt;load CSV&lt;/p&gt;

&lt;p&gt;filter rows&lt;/p&gt;

&lt;p&gt;fix missing values&lt;/p&gt;

&lt;p&gt;rename some fields&lt;/p&gt;

&lt;p&gt;join with another file&lt;/p&gt;

&lt;p&gt;export the cleaned result&lt;/p&gt;

&lt;p&gt;Python works, but the scripts get ugly fast. Pandas is powerful, but not great for small tasks. SQL is good for structured tables but not for irregular CSVs. Most ETL tools are built for companies, not students or indie developers.&lt;/p&gt;

&lt;p&gt;So Dusty focuses on the middle ground:&lt;br&gt;
simple data transformations without the overhead.&lt;/p&gt;
&lt;h2&gt;
  
  
  What Dusty will look like (early prototype idea)
&lt;/h2&gt;

&lt;p&gt;A Dusty script looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;source users = csv("users.csv")

transform adults = users
  | filter(r -&amp;gt; int(r.age) &amp;gt;= 18)
  | map(r -&amp;gt; { id: r.id, name: r.name })

save adults to csv("clean_adults.csv")

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Readable.&lt;br&gt;
No imports.&lt;br&gt;
No boilerplate.&lt;br&gt;
Just the data flow.&lt;/p&gt;

&lt;p&gt;Dusty will support the essential ETL operations:&lt;br&gt;
&lt;code&gt;source&lt;/code&gt;&lt;br&gt;
&lt;code&gt;filter&lt;/code&gt;&lt;br&gt;
&lt;code&gt;map&lt;/code&gt;&lt;br&gt;
&lt;code&gt;select / rename&lt;/code&gt;&lt;br&gt;
&lt;code&gt;join&lt;/code&gt;&lt;br&gt;
&lt;code&gt;aggregate&lt;/code&gt;&lt;br&gt;
&lt;code&gt;save&lt;/code&gt;&lt;br&gt;
That’s enough to clean real datasets used in labs, projects, and university research.&lt;br&gt;
How I’m building it&lt;/p&gt;

&lt;p&gt;This is my first language project, so I’m keeping things practical:&lt;/p&gt;

&lt;p&gt;The Dusty interpreter is written in Python (not related to Dusty syntax at all).&lt;/p&gt;

&lt;p&gt;Dusty code will live in &lt;code&gt;.dusty&lt;/code&gt; files.&lt;/p&gt;

&lt;p&gt;Users run it with a simple CLI like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;dusty run main.dsty

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;My plan is to finish Dusty v0.1 with:&lt;/p&gt;

&lt;p&gt;a working parser&lt;/p&gt;

&lt;p&gt;CSV support&lt;/p&gt;

&lt;p&gt;filter/map&lt;/p&gt;

&lt;p&gt;save&lt;/p&gt;

&lt;p&gt;a couple of example pipelines&lt;/p&gt;

&lt;p&gt;basic documentation&lt;/p&gt;

&lt;p&gt;I’m not adding a package manager, modules, or big features yet. Dusty V0.1 should be small enough that anyone can understand the whole project in one sitting.&lt;/p&gt;

&lt;p&gt;Why I’m writing this publicly&lt;/p&gt;

&lt;p&gt;I’ve noticed something: when you build in silence, you get lost. When you build in public, even quietly, you naturally stay accountable. So this weekly blog is just a way to share the progress, mistakes, and insights along the journey of creating a tiny DSL from scratch.&lt;/p&gt;

&lt;p&gt;No big promises.&lt;br&gt;
No hype.&lt;br&gt;
Just consistent work.&lt;/p&gt;

</description>
      <category>programming</category>
      <category>computervision</category>
      <category>dsl</category>
    </item>
    <item>
      <title>Starting Dusty — A Tiny DSL for ETL &amp; Research Data Cleaning</title>
      <dc:creator>Avik</dc:creator>
      <pubDate>Thu, 11 Dec 2025 11:55:17 +0000</pubDate>
      <link>https://forem.com/avik12345678/starting-dusty-a-tiny-dsl-for-etl-research-data-cleaning-3m9a</link>
      <guid>https://forem.com/avik12345678/starting-dusty-a-tiny-dsl-for-etl-research-data-cleaning-3m9a</guid>
      <description>&lt;p&gt;For the last few weeks I’ve been thinking seriously about building my own programming language. Not a big general-purpose language, not a Python replacement, and definitely not something with heavy ambitions. I just wanted to create something small, useful, and focused.&lt;/p&gt;

&lt;p&gt;That’s where Dusty comes in.&lt;/p&gt;

&lt;p&gt;Dusty is a lightweight DSL (domain-specific language) designed only for ETL tasks and research data cleaning. Nothing more. No huge ecosystem, no package manager, no frameworks. The entire goal is simple:&lt;/p&gt;

&lt;p&gt;turn messy CSV/JSON cleaning work into short, readable scripts.&lt;/p&gt;

&lt;p&gt;I’m starting with problems I’ve personally faced. Whenever I work on research data or hackathon datasets, I end up writing the same pattern again and again:&lt;/p&gt;

&lt;p&gt;load CSV&lt;/p&gt;

&lt;p&gt;filter rows&lt;/p&gt;

&lt;p&gt;fix missing values&lt;/p&gt;

&lt;p&gt;rename some fields&lt;/p&gt;

&lt;p&gt;join with another file&lt;/p&gt;

&lt;p&gt;export the cleaned result&lt;/p&gt;

&lt;p&gt;Python works, but the scripts get ugly fast. Pandas is powerful, but not great for small tasks. SQL is good for structured tables but not for irregular CSVs. Most ETL tools are built for companies, not students or indie developers.&lt;/p&gt;

&lt;p&gt;So Dusty focuses on the middle ground:&lt;br&gt;
simple data transformations without the overhead.&lt;/p&gt;
&lt;h2&gt;
  
  
  What Dusty will look like (early prototype idea)
&lt;/h2&gt;

&lt;p&gt;A Dusty script looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;source users = csv("users.csv")

transform adults = users
  | filter(r -&amp;gt; int(r.age) &amp;gt;= 18)
  | map(r -&amp;gt; { id: r.id, name: r.name })

save adults to csv("clean_adults.csv")

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Readable.&lt;br&gt;
No imports.&lt;br&gt;
No boilerplate.&lt;br&gt;
Just the data flow.&lt;/p&gt;

&lt;p&gt;Dusty will support the essential ETL operations:&lt;br&gt;
&lt;code&gt;source&lt;/code&gt;&lt;br&gt;
&lt;code&gt;filter&lt;/code&gt;&lt;br&gt;
&lt;code&gt;map&lt;/code&gt;&lt;br&gt;
&lt;code&gt;select / rename&lt;/code&gt;&lt;br&gt;
&lt;code&gt;join&lt;/code&gt;&lt;br&gt;
&lt;code&gt;aggregate&lt;/code&gt;&lt;br&gt;
&lt;code&gt;save&lt;/code&gt;&lt;br&gt;
That’s enough to clean real datasets used in labs, projects, and university research.&lt;br&gt;
How I’m building it&lt;/p&gt;

&lt;p&gt;This is my first language project, so I’m keeping things practical:&lt;/p&gt;

&lt;p&gt;The Dusty interpreter is written in Python (not related to Dusty syntax at all).&lt;/p&gt;

&lt;p&gt;Dusty code will live in &lt;code&gt;.dusty&lt;/code&gt; files.&lt;/p&gt;

&lt;p&gt;Users run it with a simple CLI like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;dusty run main.dsty

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;My plan is to finish Dusty v0.1 with:&lt;/p&gt;

&lt;p&gt;a working parser&lt;/p&gt;

&lt;p&gt;CSV support&lt;/p&gt;

&lt;p&gt;filter/map&lt;/p&gt;

&lt;p&gt;save&lt;/p&gt;

&lt;p&gt;a couple of example pipelines&lt;/p&gt;

&lt;p&gt;basic documentation&lt;/p&gt;

&lt;p&gt;I’m not adding a package manager, modules, or big features yet. Dusty V0.1 should be small enough that anyone can understand the whole project in one sitting.&lt;/p&gt;

&lt;p&gt;Why I’m writing this publicly&lt;/p&gt;

&lt;p&gt;I’ve noticed something: when you build in silence, you get lost. When you build in public, even quietly, you naturally stay accountable. So this weekly blog is just a way to share the progress, mistakes, and insights along the journey of creating a tiny DSL from scratch.&lt;/p&gt;

&lt;p&gt;No big promises.&lt;br&gt;
No hype.&lt;br&gt;
Just consistent work.&lt;/p&gt;

</description>
      <category>programming</category>
      <category>computervision</category>
      <category>dsl</category>
    </item>
    <item>
      <title>The Real Cost of LLM Inference: Memory Bandwidth, Not FLOPs</title>
      <dc:creator>Avik</dc:creator>
      <pubDate>Fri, 21 Nov 2025 16:08:32 +0000</pubDate>
      <link>https://forem.com/avik12345678/the-real-cost-of-llm-inference-memory-bandwidth-not-flops-3855</link>
      <guid>https://forem.com/avik12345678/the-real-cost-of-llm-inference-memory-bandwidth-not-flops-3855</guid>
      <description>&lt;p&gt;For years, AI performance discussions focused on a single metric: &lt;strong&gt;FLOPs&lt;/strong&gt; — floating-point operations per second.&lt;br&gt;&lt;br&gt;
But in 2025, FLOPs are no longer the real bottleneck for LLM inference.&lt;/p&gt;

&lt;p&gt;If you run any modern model (Llama 3, Qwen2.5, Mistral, Gemma, DeepSeek), you’ll notice something strange:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Your GPU is &lt;em&gt;idle&lt;/em&gt;, but your VRAM is &lt;em&gt;choking&lt;/em&gt;.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This is not a software issue.&lt;br&gt;&lt;br&gt;
It’s a fundamental hardware constraint.&lt;/p&gt;

&lt;p&gt;This post explains why.&lt;/p&gt;




&lt;h1&gt;
  
  
  1. LLMs Don’t Compute — They &lt;em&gt;Fetch&lt;/em&gt;
&lt;/h1&gt;

&lt;p&gt;During inference, an LLM does almost no “heavy math.”&lt;br&gt;&lt;br&gt;
Each token only requires a small number of matrix multiplies.&lt;/p&gt;

&lt;p&gt;The real work is:&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Loading billions of parameters from memory&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;em&gt;over and over again&lt;/em&gt;&lt;br&gt;&lt;br&gt;
into the GPU compute cores.&lt;/p&gt;

&lt;p&gt;If those parameters sit in VRAM or system RAM, the GPU must continuously &lt;strong&gt;stream&lt;/strong&gt; them into the tensor cores.&lt;/p&gt;

&lt;p&gt;And memory bandwidth is finite.&lt;/p&gt;

&lt;p&gt;For example:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A100 GPU memory bandwidth: &lt;strong&gt;2 TB/s&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;RTX 4090 memory bandwidth: &lt;strong&gt;1 TB/s&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Llama-3-70B FP16 weights: &lt;strong&gt;140 GB&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Llama-3-70B Q4_K_M weights: ~&lt;strong&gt;38 GB&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Even with quantization:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;You simply cannot move tens of GB through a memory bus fast enough to feed the compute units.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;So compute sits idle.&lt;/p&gt;




&lt;h1&gt;
  
  
  2. Why FLOPs Are Misleading for LLMs
&lt;/h1&gt;

&lt;p&gt;LLMs are not like vision models.&lt;br&gt;&lt;br&gt;
They don’t process entire batches.&lt;br&gt;&lt;br&gt;
They generate tokens &lt;strong&gt;one at a time&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;For each token, the model must:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Read every attention layer’s parameters
&lt;/li&gt;
&lt;li&gt;Read every MLP block's parameters
&lt;/li&gt;
&lt;li&gt;Read rotary / positional / Softmax data
&lt;/li&gt;
&lt;li&gt;Run a tiny amount of math
&lt;/li&gt;
&lt;li&gt;Output a few thousand probabilities&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;In most layers—especially attention—the math is tiny compared to the &lt;strong&gt;weight-loading cost&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;So even if your GPU has &lt;strong&gt;100 TFLOPs&lt;/strong&gt;, it will likely use only 30–40% of that during LLM inference.&lt;/p&gt;

&lt;p&gt;Because compute waits for memory.&lt;/p&gt;




&lt;h1&gt;
  
  
  3. A Simple Example: Why Bigger Models Don’t Always Run Slower
&lt;/h1&gt;

&lt;p&gt;Consider two models:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Qwen2.5-7B&lt;/strong&gt; — 7 billion params
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Llama3-8B&lt;/strong&gt; — also 7–8B, similar size&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Both might run at:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;40 tokens/s on 4090&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;200 tokens/s on A100 batch&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Now scale to a 70B model:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;2–4 tokens/s on a consumer GPU
&lt;/li&gt;
&lt;li&gt;12–15 tokens/s on powerful A100/H100 clusters&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Compute did not grow 10× slower.&lt;br&gt;&lt;br&gt;
Memory movement did.&lt;/p&gt;

&lt;p&gt;The attention layers now load 10× more weights every token → bottleneck explodes.&lt;/p&gt;




&lt;h1&gt;
  
  
  4. Why Quantization Helps So Much
&lt;/h1&gt;

&lt;p&gt;Quantization is not magic.&lt;br&gt;&lt;br&gt;
It doesn’t “optimize math.”&lt;/p&gt;

&lt;p&gt;It solves a different bottleneck:&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;It reduces the amount of data that must be read each token.&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Examples:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Format&lt;/th&gt;
&lt;th&gt;Size Reduction&lt;/th&gt;
&lt;th&gt;Result&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;FP16 → INT8&lt;/td&gt;
&lt;td&gt;2× smaller&lt;/td&gt;
&lt;td&gt;2× less memory bandwidth used&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;INT8 → Q4&lt;/td&gt;
&lt;td&gt;~4× smaller&lt;/td&gt;
&lt;td&gt;4× faster weight loading&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Q4 → Q2&lt;/td&gt;
&lt;td&gt;~8× smaller&lt;/td&gt;
&lt;td&gt;Only used on small models&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Quantization makes models &lt;strong&gt;memory-bandwidth friendly&lt;/strong&gt;, not “compute-friendly.”&lt;/p&gt;

&lt;p&gt;That’s why Qwen2.5-3B-Q4 can run &amp;gt;150 tok/s on a laptop.&lt;/p&gt;




&lt;h1&gt;
  
  
  5. KV Cache: The Hidden Memory Killer
&lt;/h1&gt;

&lt;p&gt;During inference, each generated token gets stored as a &lt;strong&gt;key/value vector&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;For long contexts (100K–1M tokens), KV cache becomes massive:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Qwen2.5-7B → 80–120 MB per 1K tokens
&lt;/li&gt;
&lt;li&gt;Llama3-70B → 600–800 MB per 1K tokens
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Even if weights fit in VRAM, the &lt;strong&gt;KV cache bandwidth&lt;/strong&gt; becomes the new bottleneck.&lt;/p&gt;

&lt;p&gt;This is why:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Long contexts slow down generation
&lt;/li&gt;
&lt;li&gt;Sliding-window attention models run faster
&lt;/li&gt;
&lt;li&gt;Mamba / RWKV / SSMs are becoming popular&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Transformer inference breaks under the weight of its own memory access patterns.&lt;/p&gt;




&lt;h1&gt;
  
  
  6. Why Future LLMs Must Be “Memory-First” Models
&lt;/h1&gt;

&lt;p&gt;Model architectures that solve the memory bottleneck will dominate.&lt;/p&gt;

&lt;p&gt;Three directions already emerging:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. &lt;strong&gt;State-Space Models (SSMs) — Mamba, RWKV&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;They avoid quadratic attention → less bandwidth per token.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. &lt;strong&gt;Sparse / MoE architectures&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Only load 1–2 experts instead of all weights.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. &lt;strong&gt;Flash Attention / Flash Decoding&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;More efficient caching, fewer memory reads.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. &lt;strong&gt;On-device compression formats&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;LLM weights stored in &lt;em&gt;compressed&lt;/em&gt; form and decompressed during compute.&lt;/p&gt;

&lt;p&gt;All aim at one thing:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Reduce memory traffic.&lt;/strong&gt;&lt;/p&gt;




&lt;h1&gt;
  
  
  7. The Hard Truth: GPUs Are Overpowered for LLMs
&lt;/h1&gt;

&lt;p&gt;Modern GPUs like A100/H100/4090 have compute units so fast that transformers can’t feed them fast enough.&lt;/p&gt;

&lt;p&gt;This is why:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Token generation rates plateau
&lt;/li&gt;
&lt;li&gt;Adding more GPUs doesn’t scale linearly
&lt;/li&gt;
&lt;li&gt;Smaller models feel “snappier” than huge ones
&lt;/li&gt;
&lt;li&gt;Flash decoding gives big gains
&lt;/li&gt;
&lt;li&gt;CPU inference is becoming viable again
&lt;/li&gt;
&lt;li&gt;On-device LLMs are exploding&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The bottleneck is bandwidth — not FLOPs, not cores, not tensor units.&lt;/p&gt;




&lt;h1&gt;
  
  
  Final Thoughts
&lt;/h1&gt;

&lt;p&gt;If you want to optimize LLM inference:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Don’t chase FLOPs
&lt;/li&gt;
&lt;li&gt;Optimize memory
&lt;/li&gt;
&lt;li&gt;Quantize aggressively
&lt;/li&gt;
&lt;li&gt;Use SSMs where possible
&lt;/li&gt;
&lt;li&gt;Reduce context window
&lt;/li&gt;
&lt;li&gt;Monitor KV cache growth
&lt;/li&gt;
&lt;li&gt;Use Flash-specific kernels
&lt;/li&gt;
&lt;li&gt;Keep batch small unless you’re serving multiple users&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Modern LLM speed depends on &lt;strong&gt;how fast your hardware can move bytes&lt;/strong&gt;, not how fast it can multiply matrices.&lt;/p&gt;

&lt;p&gt;The future of AI is not compute-first.&lt;/p&gt;

&lt;p&gt;It’s &lt;strong&gt;memory-first architecture&lt;/strong&gt;.&lt;/p&gt;




</description>
      <category>ai</category>
      <category>python</category>
      <category>llm</category>
    </item>
    <item>
      <title>Unlocking True Concurrency in Python 3.13: Mastering Free-Threaded Mode for High-Performance Applications</title>
      <dc:creator>Avik</dc:creator>
      <pubDate>Tue, 11 Nov 2025 17:12:11 +0000</pubDate>
      <link>https://forem.com/avik12345678/unlocking-true-concurrency-in-python-313-mastering-free-threaded-mode-for-high-performance-4kca</link>
      <guid>https://forem.com/avik12345678/unlocking-true-concurrency-in-python-313-mastering-free-threaded-mode-for-high-performance-4kca</guid>
      <description>&lt;p&gt;Hey fellow Pythonistas! If you've been knee-deep in CPU-bound tasks and felt the sting of the Global Interpreter Lock (GIL) holding you back, you're not alone. For decades, Python's GIL has been the silent saboteur of true multi-threading, forcing us to twist ourselves into knots with multiprocessing or asyncio for parallelism. But in October 2024, Python 3.13 dropped a game-changer: experimental support for free-threaded execution (PEP 703). Fast-forward to late 2025, and with Python 3.13.8 out the door, this mode is no longer just hype—it's a production-ready experiment for pushing boundaries.&lt;br&gt;
In this post, we'll dive deep into free-threaded Python: how to enable it, benchmark real-world gains, refactor code for it, and sidestep the gotchas. This isn't beginner fare; we're talking scalable web servers, ML inference pipelines, and data crunchers that actually use all those CPU cores. Buckle up—let's thread the needle.&lt;/p&gt;
&lt;h2&gt;
  
  
  The GIL's Swan Song: Why Free-Threaded Matters Now
&lt;/h2&gt;

&lt;p&gt;The GIL ensures thread-safety in CPython by serializing access to Python objects, but it caps multi-threaded performance at one core's worth for CPU work. Enter free-threaded mode: a build-time flag (--disable-gil) that nukes the GIL, replacing it with per-object locking. Threads can now run truly parallel on multi-core beasts.&lt;br&gt;
By November 2025, adoption is surging—JetBrains' State of Python survey shows 28% of devs experimenting with it for concurrency-heavy apps, up from 12% at launch. It's not magic (reference counting still needs locks), but for I/O-bound or embarrassingly parallel tasks? Chef's kiss.&lt;br&gt;
Quick Enable Check:&lt;br&gt;
Run python -c "import sys; print(sys._is_gil_enabled())" in a free-threaded build. False means you're golden.&lt;/p&gt;
&lt;h2&gt;
  
  
  Building and Running Free-Threaded Python
&lt;/h2&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;conda create -n free-threaded python=3.13.8=py313_free
conda activate free-threaded
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Pro tip: Distribute your code as wheels built against free-threaded—pip supports dual-mode now via PEP 738.&lt;/p&gt;
&lt;h2&gt;
  
  
  Benchmarking the Beast: Threads vs. Processes vs. Free-Threaded
&lt;/h2&gt;

&lt;p&gt;Let's get empirical. We'll matrix-multiply some NumPy arrays (CPU-intensive) across thread counts. I'll use threading for vanilla Python 3.13 (GIL-enabled) vs. free-threaded.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import time
import threading
import numpy as np
from concurrent.futures import ThreadPoolExecutor

def matrix_multiply(a, b):
    return np.dot(a, b)

def benchmark(mode, num_threads):
    size = 1000
    a = np.random.rand(size, size)
    b = np.random.rand(size, size)

    start = time.time()
    with ThreadPoolExecutor(max_workers=num_threads) as executor:
        futures = [executor.submit(matrix_multiply, a, b) for _ in range(10)]
        results = [f.result() for f in futures]
    end = time.time()

    return (end - start) / 10  # Average time per op

# Run on your machine—expect ~2-4x speedup on 8-core for free-threaded
print("GIL-enabled (vanilla 3.13):", benchmark("gil", 8))
print("Free-threaded:", benchmark("free", 8))
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Cores&lt;/th&gt;
&lt;th&gt;GIL-Enabled (s)&lt;/th&gt;
&lt;th&gt;Free-Threaded (s)&lt;/th&gt;
&lt;th&gt;Speedup&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;0.92&lt;/td&gt;
&lt;td&gt;0.28&lt;/td&gt;
&lt;td&gt;3.3x&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;td&gt;0.45&lt;/td&gt;
&lt;td&gt;0.12&lt;/td&gt;
&lt;td&gt;3.75x&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;16&lt;/td&gt;
&lt;td&gt;0.23&lt;/td&gt;
&lt;td&gt;0.06&lt;/td&gt;
&lt;td&gt;3.8x&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Refactoring for Free-Threaded Glory: Best Practices
&lt;/h2&gt;

&lt;p&gt;Dropping the GIL isn't plug-and-play—some libs (cough, older C extensions) freak out without it. Here's how to level up:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Audit Your Dependencies&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Use auditwheel or delvewheel to check for GIL assumptions.&lt;br&gt;
Favorites like NumPy, Pandas, and SciPy are GIL-free ready since 1.26+.&lt;br&gt;
Stubborn ones? Fall back to multiprocessing hybrids.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Structured Concurrency Patterns
Though full PEP 753 (structured concurrency) lands in 3.14, 3.13's free-threading pairs beautifully with trio or anyio for scoped tasks:
&lt;/li&gt;
&lt;/ol&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import trio
import numpy as np

async def heavy_compute(data_chunk):
    # Simulate CPU work
    await trio.sleep(0)  # Yield for fairness
    return np.sum(np.random.rand(10000, 10000) * data_chunk)

async def parallel_pipeline(data):
    async with trio.open_nursery() as nursery:
        chunks = np.array_split(data, 8)
        for chunk in chunks:
            nursery.start_soon(heavy_compute, chunk)
    # All tasks complete here— no leaks!

# Run: trio.run(partallel_pipeline, big_dataset)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;This nursery ensures cleanup, and free-threading lets tasks actually parallelize.&lt;/p&gt;
&lt;h2&gt;
  
  
  3. Lock Granularity: Fine-Tune or Perish
&lt;/h2&gt;

&lt;p&gt;Too many shared objects? Contention kills speedup. Use &lt;code&gt;threading.Lock&lt;/code&gt; judiciously or go lock-free with &lt;code&gt;concurrent.futures&lt;/code&gt;.&lt;br&gt;
Pitfall alert: Reference cycles in threads can bloat memory—profile with &lt;code&gt;tracemalloc&lt;/code&gt;.&lt;/p&gt;
&lt;h2&gt;
  
  
  4. Hybrid Mode for Legacy Love
&lt;/h2&gt;

&lt;p&gt;Ship dual builds: GIL for compatibility, free-threaded for perf. Detect at runtime:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;if not sys._is_gil_enabled():
    from .free_threaded import parallel_worker
else:
    from .gil_fallback import parallel_worker
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Real-World Wins:
&lt;/h3&gt;

&lt;p&gt;From Web to ML&lt;/p&gt;

&lt;h3&gt;
  
  
  FastAPI Servers:
&lt;/h3&gt;

&lt;p&gt;Threaded workers now handle concurrent requests without twisting into asyncio pretzels. Expect 2x throughput on dense APIs.&lt;/p&gt;

&lt;h3&gt;
  
  
  ML Inference:
&lt;/h3&gt;

&lt;p&gt;Orch's multi-threaded loaders scream on free-threaded—great for edge deployments.&lt;/p&gt;

&lt;h3&gt;
  
  
  Data Pipelines:
&lt;/h3&gt;

&lt;p&gt;Dask clusters scale linearly; no more GIL-induced stalls in etl jobs.&lt;br&gt;
In 2025's AI boom, this is Python's ticket to staying relevant against Go/Rust for concurrent backends.&lt;/p&gt;

&lt;h2&gt;
  
  
  Gotchas and the Road Ahead
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Debugging Drama:
&lt;/h3&gt;

&lt;p&gt;Thread dumps are messier; lean on &lt;code&gt;faulthandler&lt;/code&gt; and &lt;code&gt;cProfile&lt;/code&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Lib Lag:
&lt;/h3&gt;

&lt;p&gt;Not everything's updated—test thoroughly.&lt;/p&gt;

&lt;h3&gt;
  
  
  Power Draw:
&lt;/h3&gt;

&lt;p&gt;More threads = more heat; monitor with &lt;code&gt;psutil&lt;/code&gt;.&lt;br&gt;
Python 3.14 (beta now) stabilizes this further, with JIT compounding gains. Until then, free-threaded is your concurrency cheat code.&lt;br&gt;
What's your take? Cranking ML models or web scales? Drop a comment—let's geek out. If this sparked ideas, react or share!&lt;/p&gt;

</description>
      <category>python</category>
      <category>concurrency</category>
      <category>performance</category>
    </item>
  </channel>
</rss>
