Forem: compilersutra

Create Your Own LLVM Pass as a Plugin (Step-by-Step Guide)

compilersutra — Tue, 21 Apr 2026 16:48:32 +0000

f you're getting into compilers or exploring LLVM internals, one of the most powerful things you can learn is how to build your own LLVM Pass.

I’ve put together a hands-on guide that walks you through creating an LLVM pass as a plugin — something that’s actually used in real-world compiler workflows.

💡 What You’ll Learn
How LLVM passes work internally
How to create a custom pass from scratch
How to structure your plugin
How to build it using CMake
How to dynamically load and run it in LLVM
🔧 Why This Matters

Most developers use compilers as a black box. But if you're serious about:

Compiler development
Performance optimization
Static analysis
Systems programming

…then understanding LLVM passes is a game-changer.

📚 Full Guide

👉 https://www.compilersutra.com/docs/llvm/llvm_basic/pass/Create_LLVM_Pass_As_A_Plugin

🧠 Who Is This For?
Students learning compilers
Engineers exploring LLVM
Anyone building tooling around code analysis or optimisation

If you're building something cool with LLVM, I’d love to hear about it 👇
Let’s push the boundaries of compilers together.

llvm #compilers #cpp #systems #opensource #programming

From 0 to 1M Impressions: Building a Niche Compiler Blog

compilersutra — Tue, 14 Apr 2026 14:30:05 +0000

I’ve been working on a niche site focused on compilers and systems:
👉 https://www.compilersutra.com

Recently, I checked my performance over the last 12 months — and the results surprised me:

1.09M impressions
11K clicks
Ranking for topics like LLVM, OpenCL, TVM

All of this came purely from organic search.

💡 What worked for me

Picking a niche most people ignore

Compilers, low-level systems, and ML compilers aren’t “mainstream” topics.

But that’s exactly why they work.

Less noise → more authority over time.

Writing for depth, not just keywords

Instead of chasing trends, I focused on:

Explaining concepts deeply
Covering real-world use cases
Connecting theory → practical systems

Consistency over intensity

No crazy posting schedule.

Just consistent effort over time.

That’s what compounds.

Structured content > random blogs

I started thinking in terms of:

Learning paths
Roadmaps
Connected topics

Instead of isolated articles.

📈 What I’m focusing on next
Expanding compiler-related topics (LLVM, MLIR, TVM)
Building structured learning tracks
Improving user experience and engagement
[🤝 Let’s connect]

If you’re interested in:

Compilers
Systems programming
Low-level optimization

Check it out 👉 https://www.compilersutra.com

Would love your feedback and suggestions!

Memory Hierarch

compilersutra — Mon, 13 Apr 2026 11:49:45 +0000

If you're working on compilers, runtimes, or low-level systems…
Stop asking “what is cache?”

Start asking 👉 “what kind of miss did my code create?”

💡 One bad memory access = hundreds of cycles lost
💡 L1 → L3 → DRAM = massive slowdown
💡 Performance = access pattern, not just instructions
I broke it all down with real benchmarks (Ryzen 9700X) 👇

https://www.compilersutra.com/docs/coa/memory-hierarchy/

⚡ Learn:
• Cache misses & set conflicts
• False sharing & multithreading pitfalls
• TLB & page-walk cost
• Why loop tiling gives 30x speedups

AMD ML Complete Stack

compilersutra — Sun, 12 Apr 2026 07:02:28 +0000

I wrote 6 lines of Triton…

and it turned into thousands of GPU instructions.

Python → TTIR → TTGIR → LLVM → AMDGCN → HSACO

👉 a + b → buffer_load_b128

👉 mask → v_cmp + conditional execution

Here’s the truth:

Your code is NOT what runs on the GPU.

The compiler builds an entire execution pipeline in between.

I dumped every stage and traced one kernel end-to-end 👇

https://www.compilersutra.com/docs/ml-compilers/mlcompilerstack/

After this, ML compilers don’t feel like “magic” anymore.

Introduction to ML Compilers + Roadmap (MLIR, TVM, GPU Kernels)

compilersutra — Sat, 11 Apr 2026 11:33:31 +0000

Most people think they are running Python when they train ML models.

They are not.

Python is only the interface.

The real execution happens somewhere completely different — inside an ML compiler stack.

🧠 What actually happens?

When you write something like:

matmul → add → relu

It looks simple.

But internally, the system transforms it into multiple layers:

Python (model definition)
Graph (tensor operations)
Execution plan (optimized structure)
Kernels (GPU/CPU instructions)
Hardware execution

At no point does the GPU “run Python”.

It runs compiled kernels.

⚙️ Why ML Compilers exist

Because raw model code is inefficient for hardware.

Without a compiler:

Too many kernel launches
Unnecessary memory transfers
No operator fusion
Poor GPU utilization

With a compiler:

Operations are fused
Memory movement is reduced
Execution is optimized for hardware

🔥 Key concepts covered

This article builds the foundation for:

MLIR (multi-level IR systems)
TVM (end-to-end ML compiler stack)
GPU kernel execution model
Operator fusion & memory planning
Compilation pipeline design

🧭 Roadmap (what you’ll learn)

Tensors, shapes, memory layout
CPU vs GPU execution model
Compiler basics (IR, lowering, passes)
ML compiler optimizations
Real systems (TVM, MLIR, XLA)

📘 Full Article

👉 [https://www.compilersutra.com/docs/ml-compile

Building CompilerSutra

compilersutra — Thu, 02 Apr 2026 04:09:59 +0000

🚀 Building practical content on compilers, LLVM, MLIR, and performance.

If this sounds interesting, you can join here:
link

Would love to know what topics you’d like covered.

How a CPU Actually Executes Your Code (Most Developers Get This Wrong)

compilersutra — Sun, 29 Mar 2026 08:27:37 +0000

Read full blog at https://www.compilersutra.com/docs/coa/

Most developers think the CPU “runs code”.

It doesn’t.

It executes raw bytes — billions of times per second — using a tightly optimized loop called the instruction cycle.

Understanding this is the difference between writing code… and writing fast code.

🧠 The Reality

When your program runs, the CPU does NOT see:

variables
loops
functions

It only sees:

instruction bytes
memory
registers
a pointer to the next instruction (PC)

Everything else is already gone.

⚙️ The Instruction Cycle (Simplified)

Every instruction goes through this loop:

Fetch → Get instruction from memory
Decode → Understand what it means
Execute → Perform the operation
Writeback → Store the result

This happens billions of times per second.

⚡ Why This Matters

Two pieces of code can look similar…

…but run VERY differently.

Why?

Because performance depends on:

memory access (cache vs RAM)
instruction dependencies
pipeline behavior inside the CPU

🚨 Example

mov eax, [rbx]
add ecx, eax

If [rbx] hits in cache → fast
If it goes to RAM → 200+ cycles stall

👉 The CPU isn’t slow.
👉 Memory is.

🔥 The Real Trick: Pipelining

Modern CPUs don’t wait for one instruction to finish.

They overlap them:

one instruction in Fetch
one in Decode
one in Execute
one in Writeback

👉 This is called a pipeline

That’s how CPUs stay fast.

Final Insight

Performance is NOT just about instructions.
It’s about how the CPU feeds and executes them.

Full Interactive Breakdown

I built a full version with:

pipeline animations
cache stall visualizations
real execution flow

👉 https://www.compilersutra.com/docs/coa/

This is part of my deep-dive series on compilers, LLVM, and CPU performance.

Adding MCQs for LLVM & Systems Learning on CompilerSutra

compilersutra — Sat, 28 Mar 2026 02:18:37 +0000

*I just added an MCQ section for Compiler & LLVM learners!
*
If you're preparing for compiler interviews or want to strengthen your fundamentals, this might help 👇

🔗 https://www.compilersutra.com/docs/mcq/

💡 What you'll find:
• Compiler design MCQs
• LLVM-focused questions
• Concept-based learning (not just memorization)
• Helpful for interviews + self-assessment

This is an early-stage feature, so feedback is super welcome 🙌

Next planned improvements:
• Difficulty levels
• Topic-wise segregation
• Explanations for every question

If you're into compilers, low-level systems, or LLVM would love your thoughts!

GCC vs Clang: Same Instructions, Different Performance (AGU Insight)

compilersutra — Fri, 27 Mar 2026 18:01:58 +0000

*I noticed something interesting while running a GCC vs Clang benchmark.
*
Same code. Same machine.
Both loops are scalar (no vectorization).

Yet… GCC consistently used fewer CPU cycles.

At first, this doesn’t make sense.

If both:

execute roughly the same instructions
are not vectorised

Why is there a performance gap?

🔍 The Missing Piece: It’s Not Just Instructions
Most people focus on:
instruction count
vectorization

But in this case, that’s not the full story.

What actually matters more is:

how address computations are structured
how instructions are scheduled
how well latency is hidden

Here is the data

⚙️ AGU Pressure (Address Generation Units)

On x86 CPUs, memory instructions rely on AGUs (Address Generation Units).

Complex addressing patterns like:

base + index * scale + offset

👉 increase AGU pressure

Whereas simpler patterns like:
pointer++
👉 are cheaper and easier for the CPU to execute efficiently

🧪 What I Observed
GCC:
Generates simpler addressing patterns
Reduces AGU contention
Keeps execution more consistent
Clang:
Shows higher AGU pressure
More stalls
Less efficient scheduling (in this case)

⚡ Key Takeaway
It’s not just about what instructions exist.

It’s about:
How efficiently the compiler feeds the CPU pipeline

Same instruction count ≠ same performance.

📊 Why This Matters

In tight loops:

AGU pressure
addressing patterns
instruction scheduling

👉 can matter as much as (or more than) vectorization

🔗 Want to Dive Deeper?

👉 Full benchmark + assembly breakdown:

👉 Complete analysis article:

💬 Discussion

Have you seen cases where:

similar assembly
same instruction count

👉 still results in very different performance?

Would love to hear your observations.

Cpp Tip for the Performance

compilersutra — Sun, 13 Apr 2025 07:37:14 +0000

C++ Tip # 1:
https://lnkd.in/gZ6mqHyW
C++Tip #2:
https://lnkd.in/gPyaC7B6
C++Tip #3: https://lnkd.in/gjDQE9Je
C++ Tip #4: https://lnkd.in/gR4iYWSx

🚀 𝗖++ 𝗧𝗶𝗽 #𝟱:
Prefer nullptr over NULL or 0 — Type-Safe and Modern

🔒 𝘯𝘶𝘭𝘭𝘱𝘵𝘳 𝘸𝘢𝘴 𝘪𝘯𝘵𝘳𝘰𝘥𝘶𝘤𝘦𝘥 𝘪𝘯 𝘊++11 𝘢𝘯𝘥 𝘪𝘴 𝘵𝘩𝘦 𝘤𝘭𝘦𝘢𝘳, 𝘵𝘺𝘱𝘦-𝘴𝘢𝘧𝘦 𝘸𝘢𝘺 𝘵𝘰 𝘳𝘦𝘱𝘳𝘦𝘴𝘦𝘯𝘵 𝘢 𝘯𝘶𝘭𝘭 𝘱𝘰𝘪𝘯𝘵𝘦𝘳.

💥 𝐖𝐡𝐲 𝐚𝐯𝐨𝐢𝐝 𝐍𝐔𝐋𝐋 𝐨𝐫 𝟎?
𝘕𝘜𝘓𝘓 𝘪𝘴 𝘫𝘶𝘴𝘵 𝘢 𝘮𝘢𝘤𝘳𝘰 𝘧𝘰𝘳 0 (𝘰𝘳 ((𝘷𝘰𝘪𝘥*)0) 𝘪𝘯 𝘊), 𝘸𝘩𝘪𝘤𝘩 𝘤𝘢𝘯 𝘢𝘤𝘤𝘪𝘥𝘦𝘯𝘵𝘢𝘭𝘭𝘺 𝘮𝘢𝘵𝘤𝘩 𝘰𝘷𝘦𝘳𝘭𝘰𝘢𝘥𝘴 𝘰𝘳 𝘵𝘦𝘮𝘱𝘭𝘢𝘵𝘦𝘴 𝘯𝘰𝘵 𝘮𝘦𝘢𝘯𝘵 𝘧𝘰𝘳 𝘱𝘰𝘪𝘯𝘵𝘦rs.

𝟎 𝐢𝐬 𝐚𝐦𝐛𝐢𝐠𝐮𝐨𝐮𝐬 — 𝐢𝐬 𝐢𝐭 𝐚𝐧 𝐢𝐧𝐭𝐞𝐠𝐞𝐫 𝐨𝐫 𝐚 𝐧𝐮𝐥𝐥 𝐩𝐨𝐢𝐧𝐭𝐞𝐫?

✅ 𝐁𝐞𝐭𝐭𝐞𝐫 𝐀𝐩𝐩𝐫𝐨𝐚𝐜𝐡:
𝘜𝘴𝘦 𝘯𝘶𝘭𝘭𝘱𝘵𝘳 — 𝘪𝘵 𝘩𝘢𝘴 𝘵𝘺𝘱𝘦 𝘴𝘵𝘥::𝘯𝘶𝘭𝘭𝘱𝘵𝘳_𝘵, 𝘴𝘰 𝘪𝘵 𝘰𝘯𝘭𝘺 𝘤𝘰𝘯𝘷𝘦𝘳𝘵𝘴 𝘵𝘰 𝘱𝘰𝘪𝘯𝘵𝘦𝘳 𝘵𝘺𝘱𝘦𝘴

🔐 𝐌𝐨𝐝𝐞𝐫𝐧 𝐂++ 𝐢𝐬 𝐚𝐛𝐨𝐮𝐭 𝐞𝐱𝐩𝐫𝐞𝐬𝐬𝐢𝐯𝐞𝐧𝐞𝐬𝐬 + 𝐬𝐚𝐟𝐞𝐭𝐲.
Don’t write like it's 1998. Upgrade to the features C++11 and beyond offers!

Follow CompilerSutra for more such tips and subscribe 👉 https://compilersutra.com


![Image description](https://dev-to-uploads.s3.amazonaws.com/uploads/articles/1e3w3fvdse45nhx3mc6m.png)

compilersutra

Learning Compiler and Parallel Programming in 2025

compilersutra — Sun, 06 Apr 2025 01:10:31 +0000

Introduction to Parallel Programming: Unlocking the Power of GPUs(Part 1)

compilersutra — Sun, 06 Apr 2025 01:04:07 +0000

Parallel programming is a powerful technique that allows us to take full advantage of the capabilities of modern computing systems, particularly GPUs. By breaking down a task into smaller sub-tasks and running them concurrently, we can achieve higher performance and solve complex problems more efficiently.

For More visit

In this post, we’ll explore the basics of parallel programming, its importance in modern computing, and how you can get started with GPU programming to accelerate your applications.

Why Parallel Programming?
In the world of computing, many tasks can be parallelized, meaning that they can be broken into smaller pieces that can be processed simultaneously. This is especially true for applications requiring massive computational power, like machine learning, simulations, image processing, and scientific computing.

Before GPUs, most computations were done on a single CPU core, which had limitations in processing speed. With parallel computing, multiple processors (cores) can work together to solve different parts of a problem simultaneously, greatly improving performance.

What is a GPU?

A Graphics Processing Unit (GPU) is a highly parallel processor designed to handle tasks related to graphics rendering. However, it’s not limited to just graphical applications. Over the years, GPUs have become essential for accelerating non-graphical tasks, particularly in fields like machine learning, data science, and scientific computing.

Unlike traditional CPUs, which are optimized for single-threaded performance, GPUs are designed to handle thousands of threads simultaneously, making them ideal for parallel tasks.

Key Concepts in Parallel Programming

Concurrency vs. Parallelism:
Concurrency refers to the concept of multiple tasks being executed in overlapping periods but not necessarily simultaneously.

Parallelism, on the other hand, is about performing tasks simultaneously using multiple processors or cores.

Threads:
A thread is the smallest unit of execution in a process. In parallel programming, you typically create multiple threads to handle different parts of the computation simultaneously.

GPUs can execute thousands of threads in parallel, making them much faster for certain types of problems.

Synchronization:

When multiple threads are running simultaneously, it's crucial to synchronize them to avoid conflicts, such as multiple threads trying to access the same data at the same time.

Memory Management:

Efficient use of memory is key to parallel programming. GPUs have a different memory architecture compared to CPUs, and understanding how to optimize memory access can drastically improve performance.

Getting Started with GPU Parallel Programming
Now that we have a basic understanding of parallel programming, let's see how to get started with GPU programming. Here are a few tools and frameworks that make it easier:

CUDA (Compute Unified Device Architecture):

CUDA is a programming model and API created by NVIDIA that allows you to use GPUs for general-purpose computing. It supports C, C++, and Python and provides a rich set of libraries and tools to accelerate your programs.

OpenCL:

OpenCL (Open Computing Language) is an open standard for parallel programming across heterogeneous systems, including CPUs, GPUs, and other processors. It supports multiple programming languages, including C and C++.

TensorFlow & PyTorch:

Both TensorFlow and PyTorch support GPU acceleration out of the box. These frameworks are especially popular in the machine learning and data science communities for training deep learning models.

NVIDIA cuDNN:

cuDNN is a GPU-accelerated library for deep neural networks. It is optimized for deep learning operations and is commonly used with frameworks like TensorFlow, Keras, and PyTorch.

Conclusion
Parallel programming is essential for taking full advantage of modern computing power, and GPUs are an incredible tool for speeding up computation. By learning parallel programming concepts and tools like CUDA and OpenCL, you can harness the power of GPUs to accelerate your applications in fields like machine learning, simulation, and more.

Want to learn more about GPU programming? Check out the full guide on CompilerSutra for more in-depth explanations, code examples, and best practices.
let's unlock the true power of parallel computing!