I Built a GPT Model from Scratch in C++ (Runs on 2GB RAM!)

LumGenLab — Mon, 01 Sep 2025 17:05:47 +0000

Ever wondered what it takes to build a transformer from absolute scratch? No PyTorch, no TensorFlow, just raw C++ and mathematical determination.

The Challenge

Most GPT implementations today rely on heavyweight frameworks that abstract away the core mechanics. I wanted to understand every matrix multiplication, every gradient calculation, and every optimization step. So I built LumGPT - a complete GPT implementation in pure C++.

The Constraints

My hardware isn't exactly cutting edge:

AMD Phenom Triple-Core @ 2.4GHz (2008 era)
2GB DDR2 RAM with only 700MB free
No GPU (GTX 210 doesn't count)
Regular HDD storage

The question was: can you train a transformer on hardware that predates the transformer paper?

What I Built

LumGPT includes everything you'd expect from a production transformer:

Multi-head attention with causal masking
Layer normalization (pre-LN like GPT-2)
Feed-forward networks with GELU activation
AdamW optimizer with weight decay
Advanced sampling (temperature + top-k)
Custom tensor operations optimized for cache efficiency

The Results

Memory footprint: 32MB
CPU usage: 45%
Training time: 8 minutes per 200 iterations

Loss progression:
Step 0: 4.5875
Step 200: 3.1597
Step 2000: 3.2377

The model converged reasonably well on TinyShakespeare, but here's where it gets interesting.

The Dataset Experiment

TinyShakespeare has 40,000 lines but only 65 unique characters. I tried something different: a custom dataset of 202 modern jokes (Nasiruddin collection) with only 3,000 lines but 82 unique characters.

The smaller dataset with richer vocabulary actually showed better learning characteristics. Sometimes data quality beats quantity.

Technical Deep Dive

Memory Management

Every tensor operation is optimized for cache locality. No dynamic allocations during training loops. Thread-local RNG for reproducibility without global state.

Mathematical Precision

All gradients computed from first principles. Layer norm backward pass implements the full mathematical derivation, not approximations. Combined softmax-cross entropy gradients for numerical stability.

The Attention Implementation

// Scaled dot-product with causal masking
matmul(Q_head, K_head_T, scores);
double scale = 1.0 / sqrt(d_k);
// Apply causal mask
for (size_t i = 0; i < seq_len; ++i) {
    for (size_t j = i + 1; j < seq_len; ++j) {
        scores.data[i * seq_len + j] = NEG_INF;
    }
}

No shortcuts. Every operation follows the mathematical definitions exactly.

What's Next

This is just version 1. The next iteration will include:

4-bit quantization with QAT
RoPE positional embeddings
ALiBi attention bias
Eigen 3.4.0 integration
Custom inference optimizations

Why This Matters

Framework abstractions are useful, but they can hide fundamental understanding. Building from scratch taught me why certain architectural choices matter, how gradients actually flow through transformers, and where the computational bottlenecks really are.

Plus, proving that meaningful AI can run on decade-old hardware opens possibilities for edge deployment and democratized access.

The Code

The complete implementation is open source on GitHub.

Building your own transformer is challenging but incredibly rewarding. You gain intuition that no amount of framework usage can provide.

What's your experience with implementing models from scratch? Have you tried building transformers without frameworks?

I Built a Ray Tracer That Runs on a Dinosaur PC and It's Cooking RTX 4090s

LumGenLab — Wed, 20 Aug 2025 15:25:40 +0000

So... I just spent 30 minutes building a ray tracer from scratch in pure C++17, and the results are honestly blowing my mind. Not because it's some groundbreaking new algorithm, but because it's running on hardware that's older than some of the developers reading this post.

The Setup (Prepare to Laugh)

Here's what I'm working with:

CPU: AMD Phenom™ Triple-Core 2.40 GHz (yes, triple-core was a thing)
RAM: 2GB DDR2 (with 564MB actually free)
GPU: GTX 210 (which doesn't even show up as available, just outputs to monitor)
Storage: 149GB HDD (the clicking kind)

This is literally a PC from the Pentium dual-core era that someone slapped an AMD sticker on. I'm pretty sure my phone has more computing power.

Why I Did This

I was getting frustrated with modern rendering engines. Want to do some ray tracing in Blender? Better have 16GB+ RAM. Unreal Engine 5 with Lumen? Hope you've got that RTX 4090 ready.

But here's the thing - ray tracing is just math. Really beautiful, elegant math. So I thought: what if I stripped away all the bloat and just focused on the core algorithms?

All available softwares like Blender, Maya, Unreal Engine, Unity even the growing Godot Engine can't run on my PC so I thought to make an advanced ray tracing engine that can run on PC while powerful than these software renderers. No OpenGL, no external libraries, no fancy frameworks. Just pure C++17 and the Windows API for display. One file. One simple, beautiful file.

The Results Speak for Themselves

Rendered Scene 1

Rendered Scene 2

These renders took about 40 seconds each at 800x600 with 200 samples per pixel and max_depth of 50. On my dinosaur rig. While using only 11.4MB of RAM.

Let that sink in for a moment.

What's Under the Hood

The engine includes everything you'd expect from a modern ray tracer:

🎨 Physically Based Materials

Lambertian diffuse surfaces (that red sphere)
Metals with configurable roughness
Glass with proper Fresnel reflectance (look at those refractions!)
Emissive materials for light sources

🔬 Advanced Math

Full 3D vector operations with reflection/refraction
4x4 transformation matrices
Monte Carlo integration for path tracing
Importance sampling to reduce noise

⚡ Performance Features

Multi-threaded rendering (maxes out all 3 cores)
AABB bounding volume acceleration
Efficient memory management
Lock-free progress tracking

🎯 Geometric Primitives

Analytical sphere intersections
Triangle meshes (the blue cube is made of 12 triangles)
Automatic normal calculation
UV coordinate generation

The Code Philosophy

I kept everything in one file. Yes, one single rayengine.cpp file. Here's why:

No dependency hell - Download, compile, run
Easy to understand - Everything's right there
Fast compilation - Builds in seconds
Maximum portability - Works anywhere C++17 does

// The entire camera ray generation in ~20 lines
Ray get_ray(double s, double t, std::mt19937& rng) const {
    Vec3 rd = lens_radius * random_in_unit_disk(rng);
    Vec3 offset = u * rd.x + v * rd.y;
    return Ray(
        origin + offset,
        lower_left_corner + s*horizontal + t*vertical - origin - offset
    );
}

Clean, readable, and it just works.

Performance That Makes You Think

Here's what really gets me excited about this project:

Memory: 11.4MB vs Blender's 500MB+
Dependencies: 0 vs thousands
Build time: 3 seconds vs hours of CMake hell
Quality: Matches industry renderers

When you see glass that refracts properly, metals that reflect accurately, and shadows that feel real and all coming from a PC that predates YouTube HD - it makes you question why modern engines are so bloated.

The Technical Wins

The threading implementation is something I'm particularly proud of. Instead of fighting with complex synchronization, I just split the image into rows and let each thread work independently:

// Simple but effective parallel rendering
threads.emplace_back([this, start_row, end_row, t]() {
    for (int j = start_row; j < end_row; ++j) {
        render_row(image_height - 1 - j, t);
    }
});

Result? 100% CPU utilization with zero system lag. The progress bar updates smoothly, the mouse still works, and all three cores are sweating to give me those pixels.

What This Proves

Modern graphics programming has lost its way. We've become so dependent on massive engines and external libraries that we've forgotten the fundamentals still work amazingly well.

You don't need:

❌ A $1,600 graphics card
❌ 32GB of RAM
❌ Gigabytes of engine downloads
❌ Complex build systems

You just need:

✅ Good algorithms
✅ Clean math
✅ Efficient code
✅ Understanding of the physics

Try It Yourself

The entire engine is open source on GitHub:

🔗 Ray Engine Repository

Building is stupid simple:

g++ -std=c++17 -O3 -m64 -flto -pthread -mwindows -static-libgcc -static-libstdc++ -o ray_engine rayengine.cpp -lgdi32 -luser32

That's it. No CMake, no vcpkg, no package managers. Just compile and watch the magic happen.

What's Next?

I'm thinking about adding:

BVH acceleration structures for complex scenes
Volumetric rendering (fog, smoke, god rays)
Mesh loading (OBJ files)
Maybe even some GPU compute shaders

But honestly? Part of me wants to see how far I can push this single-file approach. There's something beautiful about keeping it simple.

Challenges For You

Code an advanced ray tracer from scratch using pure C++17 standard libraries but it should run on my PC not on Core i9.

Build a 3D modelling software from scratch in C++17 using standard libraries but super efficient, ultra powerful, and should work on that dinosaur PC.

The Real Lesson

Sometimes the best solution isn't the most complex one. Sometimes it's just good math, clean code, and a refusal to accept that "this is how things are done."

If a ray tracer can look this good on hardware from 2008, maybe we should stop assuming we need cutting-edge everything to create something beautiful.

Comment what should I code next? I'm thinking either a software rasterizer or maybe diving into fluid simulation or even a new 3D modelling software. What would you like to see built from scratch? 🚀

Built with ❤️ and way too much coffee on the world's most patient dinosaur PC

Forem: LumGenLab