Forem: Palash Kanti Kundu

Building a Character-Level Language Model in Rust: From Zero to "Aha!"

Palash Kanti Kundu — Sat, 14 Feb 2026 22:13:06 +0000

We’ve all seen the magic of Large Language Models. You type a prompt, and it finishes your sentence. But beneath the billions of parameters and massive GPU clusters, there is a fundamental mathematical heartbeat: The N-Gram.

Today, we’re going to look under the hood of a neural N-Gram generator built from scratch in Rust. No PyTorch. No hidden abstractions. Just pure logic, traits, and tensors.

1. The Core Idea: "What comes next?"

At its simplest, a language model is just a professional guesser. If I give you the letters r-u-s-, your brain immediately screams t.

Our model does exactly this using a sliding window. We take a word like r-u-s-t, and break it into training pairs:

Context: ... -> Target: r
Context: ..r -> Target: u
Context: .ru -> Target: s

In our Rust implementation, we define this context window (the "N" in N-Gram) as a multiplier. And we slide through the window for each token we feed it and generate input/output vectors.

for name in name_list {
   let full_name = format!("{}{}.", pad_str, name);
   let chars_vec: Vec<char> = full_name.chars().collect();

   for window in chars_vec.windows(multiplier as usize + 1) {
      for i in 0..multiplier {
        inputs.push(stoi[&window[i as usize]] as u32);
      }
      targets.push(stoi[&window[multiplier as usize]] as u32);
   }
}

2. The First "Aha!" Moment: Characters aren't Numbers

Computers can't read the letter 'a'. We have to translate it into math. We use One-Hot Encoding. If our alphabet has 27 characters (a-z and a special "start/end" token), the letter 'a' becomes a vector of length 27 with a 1.0 at index 1 and 0.0 everywhere else.

// A snippet from our one-hot utility
pub fn one_hot_encode<T: Tensor<D>, D: Numeric>(
    labels: &[u32],
    num_classes: u32,
    labels_per_sample: u32,
) -> Result<T, String> {
    let num_samples = (labels.len() as u32) / labels_per_sample;
    let row_width = (num_classes * labels_per_sample) as usize;
    let mut data = vec![D::zero(); (num_samples as usize) * row_width];

    for (i, sample_labels) in labels.chunks(labels_per_sample as usize).enumerate() {
        for (j, &label_idx) in sample_labels.iter().enumerate() {
            if label_idx >= num_classes {
                return Err(format!(
                    "Label index {} exceeds num_classes {}",
                    label_idx, num_classes
                ));
            }
            let index = (i * row_width) + (j * num_classes as usize) + label_idx as usize;
            data[index] = D::one();
        }
    }

    T::new(vec![num_samples, row_width as u32], data)
}

The Insight: By doing this, we turn a linguistics problem into a geometry problem. Every character is now a coordinate in high-dimensional space.

3. The Rush to find The Approximation

Once we figure out how we can convert language texts (list of names in our case) for a suitable input/target combination for Supervised Learning, we just let the neural network take care of the rest:

nn.fit(&x_train, &y_train, &x_val, &y_val, config, hook_config)?;

4. The "Creative" Moment: Temperature Scaling

After training, our network doesn't output letters; it outputs logits—raw, unnormalized scores for every character in our vocabulary. To turn these scores into a "choice," we need a probability distribution. This is where we introduce Temperature.

Think of Temperature as a "confidence dial." Mathematically, we modify the standard Softmax function by dividing our logits by before exponentiating them:

Low Temperature: The "Safe Bet." Dividing by a small number makes high scores much higher and low scores much lower. The distribution becomes "peaky," and the model becomes highly confident and conservative. It will likely only generate the most common names from your dataset.
High Temperature: The "Risk Taker." Dividing by a larger number flattens the differences between scores. The distribution becomes "uniform," making rare character transitions almost as likely as common ones. This is where the model gets "creative," inventing names that feel phonetically plausible but entirely new.

In our Rust implementation, we apply this directly during the generation loop to influence the weights used for random sampling:

// From n_gram.rs: Applying temperature to the raw tensor output
let mut weights: Vec<f64> = data
    .iter()
    .map(|val| (val.f64() / temparature).exp())
    .collect();

The Insight: By simply adjusting a single denominator (), we shift the model's behavior from a rigid database lookup to a creative linguistic engine. We use this in the generator to find the "sweet spot" where names are fresh and innovative without devolving into unpronounceable gibberish.

5. Dealing with the Noise: Label Smoothing

Neural networks are prone to overconfidence. They want to be 100% sure that 'q' is followed by 'u'. But in a small dataset, this leads to overfitting.

We implement Label Smoothing. Instead of targeting a 1.0 probability, we target 0.9 and spread the remaining 0.1 across all other letters. This forces the model to stay "curious" and prevents the gradients from exploding.

// Add Label Smoothing
let epsilon = D::from_f64(0.1); // The "smoothing" factor
let num_classes = D::from_u32(vocab_size);

let y_train_data = y_train.get_data();
let mut smooth_data = vec![];

for val in y_train_data {
  // Standard one-hot is [0, 1, 0]
  // Smoothed becomes [0.003, 0.903, 0.003]
  smooth_data.push(val * (D::one() - epsilon) + (epsilon / num_classes));
}

let y_train = T::new(y_train.get_shape().to_vec(), smooth_data)?;

6. The Result: Artificial Life

When you run the generator, you see the "Innovation Rate." Our code checks the generated name against the training set. If the model outputs "Alara" and that name wasn't in the original list, we've successfully taught a machine the concept of a name without it just memorizing a list.

Here are few interesting ones I could see my machine invent based on 1084 Bengali Names:

✓ 'manvi ' NEW | Innovation Rate: 100.00%
✓ 'jasha ' NEW | Innovation Rate: 85.11%
✓ 'naru ' NEW | Innovation Rate: 46.08%

Where to Find the Whole

Download or clone this repo - https://github.com/Palash90/iron_learn
Build it following the instructions mentioned in the README and run the following command:

target/release/iron_learn -n 5-gram -x n-gram --n-gram-size 5 -d data/names.txt -m 5 -e 20 -l 0.1

Backprop Finally Made Sense When I Reimplemented It in Rust

Palash Kanti Kundu — Mon, 02 Feb 2026 06:15:36 +0000

I never used PyTorch or TensorFlow.

My ML background was NumPy and scikit-learn. I could train models, tune parameters, and get reasonable results. But when it came to explaining why things worked, my understanding was shaky.

Backpropagation especially felt like a black box.

I knew the steps at a high level.
I knew gradients were involved.
I knew the library handled it.

But I didn’t feel it.

So I stopped using ML libraries entirely and rebuilt the core of a neural network from scratch in Rust.

That’s when backprop finally made sense.

Removing the Magic

The problem wasn’t NumPy or scikit-learn. They do exactly what they promise. The problem was that they abstract away everything that actually matters for understanding.

So I removed the abstractions.

No autograd.
No tensor libraries.
No hidden memory layouts.

Just flat buffers, explicit indexing, and matrix operations written by hand.

data = [1, 2, 3, 4, 5, 6]
shape = (2 rows, 3 cols)

Logical view:
[ 1  2  3 ]
[ 4  5  6 ]

Memory view:
[1][2][3][4][5][6]
 0  1  2  3  4  5

Rust forced me to be precise. You can’t “kind of” do a transpose in Rust. You have to explain exactly how indices move in memory. You can’t wave at gradients. You have to compute and store them explicitly.

index = row * cols + col

That constraint changed everything.

Where Backprop Clicked

Backprop stopped being mysterious when I had to implement it myself.

Not symbolically.
Not as equations on paper.
But as code that moves numbers through memory.

Once you build it manually, you see that backprop is not magic. It’s structured bookkeeping.

You’re doing three things over and over:

applying the chain rule
reusing intermediate values from the forward pass
pushing gradients backward through matrix operations

Forward pass:
X → [ Linear ] → [ Activation ] → ŷ → Loss

Backward pass:
∂Loss → [ dActivation ] → [ dLinear ] → ∂W, ∂X

When you write this by hand, a few things become painfully clear:

gradients don’t “flow” — they are accumulated
shape alignment is the real constraint, not calculus
most bugs come from incorrect assumptions about dimensions, not math

        ┌─── w1 ───┐
X ──► (+)         (+) ──► Loss
        └─── w2 ───┘

Backward:
∂Loss/∂X = ∂Loss/∂path1 + ∂Loss/∂path2

Backprop felt hard before because I never saw where the numbers actually lived.

Why Rust Helped

Rust isn’t important because it’s fast here. It’s important because it’s unforgiving.

It forces you to confront:

how tensors are laid out in memory
when data is copied vs reused
which operations allocate new buffers
which gradients depend on which forward values

I avoided third-party crates on purpose and used only the standard library. The goal wasn’t elegance or performance. It was transparency.

If something worked, I wanted to be able to explain why it worked at the level of indices and buffers.

What I Built

Step by step, I built:

a tensor type backed by a flat buffer
element-wise operations
transpose, reduction, and matrix multiplication
linear regression
backpropagation and gradient updates
a small neural network trained end-to-end

Nothing is optimized. Everything is explicit.

This is not a framework.
It’s not production-ready.
It’s a learning tool.

Who This Is For

It is especially suited for:

Software developers who want to understand neural networks beyond high-level APIs
Readers learning Rust who want a demanding, systems-oriented project

If backprop still feels like something you “accept” rather than understand, rebuilding it once is worth the time.

The Full Walkthrough

I documented the entire process as a chapter-style guide, starting from tensors in memory and ending with a working neural network.

You can read the full walkthrough here:
https://ai.palashkantikundu.in

Backprop didn’t become simpler.
It became visible.

And that made all the difference.

Writing a very Simple Terminal Plotter in Rust

Palash Kanti Kundu — Thu, 22 Jan 2026 00:18:44 +0000

In the journey of writing this guide - Machine Learning from First Principles, I set a challenging constraint: zero third-party libraries. This project is about a minimalistic, systems-level understanding—building tensors, matrix operations, and backpropagation in Rust so you can inspect every memory access and gradient step.

But even when building from scratch, you can't fly blind. Visualization is a necessity in ML. To stay true to the "zero dependency" rule, I had to build my own plotting tool using nothing but the Rust standard library and terminal ANSI codes.

The Philosophy: Radical Transparency

Most developers reach for a plotting library immediately. However, when the goal is mastery over production, adding a massive dependency tree feels like a cheat. By building our own plotter, we ensure that the tools we use to verify our math are just as transparent as the math itself.

The End Result

Before jumping into implementation, here is a glimpse what it does:

And here is a more complex one:

Defining the Data: The `Trace` Struct

Before we can render a single pixel, we need a way to describe our data. The Trace struct acts as our container for data series, allowing us to toggle between scatter plots and line graphs.

pub struct Trace {
    pub name: String,
    pub x: Vec<f32>,
    pub y: Vec<f32>,
    pub color: PlotColor,
    pub is_line: bool,
}

This allows us to overlay multiple metrics like "Training Loss" vs. "Validation Loss" using a variety of ANSI-powered colors.

The Rendering Engine: `render_plot`

The heart of the tool is the render_plot function. It constructs an entire coordinate system within a 2D grid of strings.

1. Mapping and Normalization

Since terminal dimensions are fixed (e.g., 80x40), but data values can be anything, we use a map_val helper to scale our floats into grid coordinates.

2. Drawing Lines with "Lerp"

To visualize a continuous function, we can't just plot dots. We implement a linear interpolation algorithm in draw_line to fill the gaps between data points with middle-dot characters (·).

fn draw_line(grid: &mut Vec<Vec<String>>, x0: usize, y0: usize, x1: usize, y1: usize, color: &str) {
    let steps = (x1 as i32 - x0 as i32).abs().max((y1 as i32 - y0 as i32).abs());
    for i in 0..=steps {
        let t = i as f32 / steps as f32;
        let x = (x0 as f32 + (x1 as i32 - x0 as i32) as f32 * t) as usize;
        let y = (y0 as f32 + (y1 as i32 - y0 as i32) as f32 * t) as usize;
        // ... grid boundary check and coloring ...
    }
}

3. UI Polish: Title and Spacing

To make the output readable during fast training loops, the plotter includes:

Buffer Gaps: Two empty lines at the top to separate the plot from previous terminal output.
Centered Titles: A bold, Cyan-colored title centered horizontally based on the plot width.

// Atomic Buffer Print with UI polish
buffer.push_str("\n\n"); // The Gap
let padding = (width - title.len()) / 2;
buffer.push_str(&format!("{}\x1b[1;36m{}\x1b[0m\n\n", " ".repeat(padding), title.to_uppercase()));

Why This Matters

This isn't about making the terminal look pretty; it's about ownership. When you build the plotter yourself:

You understand the coordinate system. You aren't guessing how your data is scaled.
Just pure dopamine

If you want to use it

Here is how I came up to it:

use std::f32;

pub struct Trace {
    pub name: String,
    pub x: Vec<f32>,
    pub y: Vec<f32>,
    pub color: PlotColor,
    pub is_line: bool,
}

#[derive(Debug, Clone, Copy)]
pub enum PlotColor {
    Red,
    Blue,
    Green,
    Cyan,
    Magenta,
    Yellow,
    White,
    Reset,
}

impl PlotColor {
    pub fn to_ansi(&self) -> &'static str {
        match self {
            PlotColor::Red => "\x1b[31m",
            PlotColor::Blue => "\x1b[34m",
            PlotColor::Green => "\x1b[32m",
            PlotColor::Cyan => "\x1b[36m",
            PlotColor::Magenta => "\x1b[35m",
            PlotColor::Yellow => "\x1b[33m",
            PlotColor::White => "\x1b[37m",
            PlotColor::Reset => "\x1b[0m",
        }
    }
}

pub fn render_plot(
    traces: &[Trace],
    width: usize,
    height: usize,
    fixed_bounds: Option<(f32, f32, f32, f32)>,
    title: String,
) {
    let (min_x, max_x, min_y, max_y) = match fixed_bounds {
        Some(bounds) => bounds,
        None => get_bounds(traces),
    };

    let margin_l = 10;
    let margin_b = 2;
    let plot_w = width - margin_l - 2;
    let plot_h = height - margin_b - 2;

    let y_tick_count = 5;
    let x_tick_count = 4;

    let mut grid = vec![vec![" ".to_string(); width]; height];

    for i in 0..=y_tick_count {
        let t = i as f32 / y_tick_count as f32;
        let py = map_val(t, 0.0, 1.0, plot_h as f32, 0.0) as usize;
        let val = map_val(t, 0.0, 1.0, min_y, max_y);

        grid[py][margin_l] = "┼".to_string();

        let label = format!("{:>9.1}", val);
        for (idx, c) in label.chars().enumerate() {
            if idx < margin_l {
                grid[py][idx] = c.to_string();
            }
        }
    }

    for i in 0..=x_tick_count {
        let t = i as f32 / x_tick_count as f32;
        let px = map_val(t, 0.0, 1.0, 0.0, plot_w as f32) as usize + margin_l + 1;
        let val = map_val(t, 0.0, 1.0, min_x, max_x);

        if px < width {
            grid[plot_h][px] = "┴".to_string();

            let label = format!("{:.1}", val);
            for (idx, c) in label.chars().enumerate() {
                if px + idx < width {
                    grid[plot_h + 1][px + idx] = c.to_string();
                }
            }
        }
    }

    for y in 0..plot_h {
        if grid[y][margin_l] == " " {
            grid[y][margin_l] = "│".to_string();
        }
    }
    for x in margin_l + 1..width {
        if grid[plot_h][x] == " " {
            grid[plot_h][x] = "─".to_string();
        }
    }
    grid[plot_h][margin_l] = "└".to_string();

    for trace in traces {
        let color_code = trace.color.to_ansi();
        for i in 0..trace.x.len() {
            let px = map_val(trace.x[i], min_x, max_x, 0.0, plot_w as f32) as usize + margin_l + 1;
            let py = map_val(trace.y[i], min_y, max_y, plot_h as f32 - 1.0, 0.0) as usize;

            if py < plot_h && px > margin_l && px < width {
                if trace.is_line && i > 0 {
                    let prev_px = map_val(trace.x[i - 1], min_x, max_x, 0.0, plot_w as f32)
                        as usize
                        + margin_l
                        + 1;
                    let prev_py =
                        map_val(trace.y[i - 1], min_y, max_y, plot_h as f32 - 1.0, 0.0) as usize;
                    draw_line(&mut grid, prev_px, prev_py, px, py, color_code);
                }
                grid[py][px] = format!("{}●\x1b[0m", color_code);
            }
        }
    }

    let mut buffer = String::new();
    buffer.push_str("\x1b[2J\x1b[H\x1b[?25l");

    buffer.push_str("\n\n");
    let title_len = title.len();
    if title_len < width {
        let padding = (width - title_len) / 2;
        buffer.push_str(&" ".repeat(padding));
    }
    buffer.push_str(&format!("\x1b[1;36m{}\x1b[0m\n\n", title.to_uppercase()));

    for row in grid {
        buffer.push_str(&row.concat());
        buffer.push('\n');
    }

    buffer.push('\n');
    for t in traces {
        buffer.push_str(&format!(
            "{} {} {} \x1b[0m  ",
            t.color.to_ansi(),
            if t.is_line { "──" } else { "●" },
            t.name
        ));
    }
    print!("{}", buffer);
    println!("\x1b[?25h");
}

fn draw_line(grid: &mut Vec<Vec<String>>, x0: usize, y0: usize, x1: usize, y1: usize, color: &str) {
    let steps = (x1 as i32 - x0 as i32)
        .abs()
        .max((y1 as i32 - y0 as i32).abs());
    for i in 0..=steps {
        let t = i as f32 / steps as f32;
        let x = (x0 as f32 + (x1 as i32 - x0 as i32) as f32 * t) as usize;
        let y = (y0 as f32 + (y1 as i32 - y0 as i32) as f32 * t) as usize;
        if y < grid.len() && x < grid[0].len() {
            grid[y][x] = format!("{}·\x1b[0m", color);
        }
    }
}

fn get_bounds(traces: &[Trace]) -> (f32, f32, f32, f32) {
    let all_x: Vec<f32> = traces.iter().flat_map(|t| t.x.iter()).cloned().collect();
    let all_y: Vec<f32> = traces.iter().flat_map(|t| t.y.iter()).cloned().collect();
    (
        *all_x
            .iter()
            .min_by(|a, b| a.partial_cmp(b).unwrap())
            .unwrap(),
        *all_x
            .iter()
            .max_by(|a, b| a.partial_cmp(b).unwrap())
            .unwrap(),
        *all_y
            .iter()
            .min_by(|a, b| a.partial_cmp(b).unwrap())
            .unwrap(),
        *all_y
            .iter()
            .max_by(|a, b| a.partial_cmp(b).unwrap())
            .unwrap(),
    )
}

fn map_val(val: f32, in_min: f32, in_max: f32, out_min: f32, out_max: f32) -> f32 {
    if (in_max - in_min).abs() < 1e-6 {
        return out_min;
    }
    (val - in_min) * (out_max - out_min) / (in_max - in_min) + out_min
}

pub fn main() {
    let mut traces = vec![];

    let t = Trace {
        name: "Predict 1".to_string(),
        x: vec![-5.0, -4.0, -3.0, -1.0, 0.0, 1.0, 2.0, 3.0, 4.0, 5.0],
        y: vec![25.0, 16.0, 9.0, 1.0, 0.0, 1.0, 4.0, 9.0, 16.0, 25.0],
        color: PlotColor::Cyan,
        is_line: true,
    };

    traces.push(t);

    render_plot(&traces, 70, 25, None, String::from("Parabola"));
}

It's not perfect, it's not highly optimized or anything but works out of box, you don't need to spend a whole day setting up things to just go through one chapter. You own it. If something bothers you, some println!(); statements and you are good to go.

That's it for now, we'll soon meet when I build something else.

Teaching my computer to invent names

Palash Kanti Kundu — Wed, 07 Jan 2026 08:20:52 +0000

After experimenting with teaching my computer to draw black‑and‑white line art, I wondered: can it also learn words?

So I grabbed a list of 500 Indian names, built a 5‑gram vector, and fed it into a Vanilla Neural Network (written in Rust). Following are the network configuration:

║ Name: five-gram
║ Parameter Count: 758056
║ Hidden Layers: 4
║ Vocabulary Size: 25
║ Loss Function: Categorical Cross Entropy
║ Optimizer: Gradient Descent

Fifteen minutes later, it started generating brand‑new names that felt surprisingly authentic.

Some of my favorites: Yaman, Samanya, Samika, Praman, Sakhi, Debika, Mazhar, Maera, Narayani, Manyashree, Adhya, Manpreet, Jameera, Kash, Kaya, Nidhi…

It’s fascinating to watch creativity emerge from code.

The Grand Finale: Image Reconstruction Network from Scratch in Rust

Palash Kanti Kundu — Sun, 04 Jan 2026 03:57:00 +0000

The next day, I checked things started evolving properly, albeit slowly. There was no improvement in execution time. It got stuck in that 34 - 37 second time frame. My gut feeling told me, the destination was not far. However, when I opened my IDE for debugging, it was not looking so. It was again a mess after all the new code amendments and new experimentation. It was getting tougher to debug without a nice looking code base.

So I decided on end to end refactoring and code clean up.

Oh boy!!!

The Breakthrough in the Boredom

I found my answer amidst the boredom of refactoring. Suddenly it crossed my mind that 64 bit floating points (f64) might be the bottleneck in my Tensor Library. To confirm my theory, I switched to the Python script, I switched every array to float64, instead of float32 and ran it. This time, it started taking more than 50 seconds, far more than my Rust Tensor program.

Yes, I cracked it. It was not Rust, Python, cupy or GPU that was making the difference in the execution time, it was the data type. The holy grail of low level programming.

I switched back to my Rust program to fix the remaining clean up, refactoring and build issues.

The Rust Type System

Well, I found the issue and now I have to implement the fix in Rust. I was in the middle of the refactoring. I already separated the network layer, the builder and the loss functions. While doing so I struggled to fix the Generics Issues. I managed to work around those.

The final blow came when I realized that the neural network I have built heavily relies on floating point mathematics. Especially, gradient descent, learning rate, scaling and literally every math operation works on floating point mathematics. I had a vague idea on how to solve the problem. I do all my maths in floating point and then round off the final result to Integer.

Right then, out of curiosity, I started finding about it, how Integer neural networks work. My idea was covert to float, do the calculation and round off. However I came across some other idea: Quantization. I did not bother to look at it. I will get to know in the due course of time if I even cross its path.

At that point, my main motto was to fix the build issues, generics issues and make my program run again. And probably, use f32 instead of f64.

The Illegal Access

I made all the fixes and built the code. The build was successful and I started running the program. Execution passed 30 seconds, my hope started building up and just then the tensor program started failing with IllegalMemoryAddress and forward pass resulted in NaN once again.

The Great Demotivation

That run time error tipped me off my threshold. It was enough. I spent enough time and resources to fix everything. I already achieved what I wanted this project to give me. I learnt Rust, fought with the compiler, worked with FFI, invoked external device, wrote CUDA Programs, learnt Machine Learning and Deep Learning, successfully wrote a neural network, learnt about SIREN. This whole experience molded me into a better Rustacean and a more knowledgeable ML Hobbyist path than I was 2 years ago.

I accepted my fault in the plan, Iron Learn was too ambitious as a goal to pursue, that too single handedly. I thought, it would be better if I shut it down at that point.

I was ready to accept: 'No more Iron Learn'.

I made a "Sunset" plan and visited Google Graveyard to console myself "Look, Google failed too, that big giant made a mistake too and most of them are older than your 2 year project. Don't mourn, don't take it to your heart and just do it".

I came back with a heavy heart, consulted Gemini and ChatGPT to prepare a mourning speech that I will post in the README.md and called it a day.

The Drama King

Next day morning I opened the IDE to write the eulogy. Started playing "Yaariyaan Reprise from Cocktail", one of my favourite emotional songs. Shed a few drops of tears. I know it's nothing to the outside world but Iron Learn was one of the finest project I have ever worked with in my 18 years of acquaintance with computers.

Again, don't judge me. I do things at times that make no sense at all or makes the most sense that I don't understand at all.

In Bengali, we have a popular proverb - রাখে হরি মারে কে| (If god protects, no one can destroy). That exactly what happened to Iron Learn.

I consulted my AI friends and came up with an eulogy for Iron Learn. Started writing the first line.

The same inner voice told me, "Why not give it a last spin?"

I ignored and started writing the second line. "You are being unreasonable here".

Again ignored the inner voic and started the third line. "I promise you, if not solved within 1 hour, I will let you complete the eulogy".

Ok fine. Let's take the final spin.

Defying the Flatline

I cleared up everything I wrote, I discarded the changes in the README.md and started finding the root cause of the memory error. It was right there in the Kernel programs. I switched everything to f32 in the Rust program but my kernels were still referring to double, a change I made back in the days of Logistic Regression when everything I wrote was in f64.

Basically, for every allocated byte block, I was trying to read twice as much and was running into the protected memory space. The GPU version of segfault.

After the fix, things started running magically again and I fixed it under 45 minutes. My Iron friend came back from Coma. It was alive again.

The NaN Fix

After the fixing the memory access, I still had the second one to tackle too.

This one was actually easy and kind of known to me too. I did not have a clip function in either of my CPU or GPU tensors and at that time, I was using Binary Cross Entropy function which can result in a NaN due to a log function in use. Log of zero or negatives result in NaN.

I implemented the clip:

let result = data
            .iter()
            .map(|a| {
                if *a < min {
                    min
                } else if *a > max {
                    max
                } else {
                    *a
                }
            })
            .collect();

extern "C" __global__ void clip(const float *s, float *r, int n, float min, float max)
{
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n)
    {
        if (s[idx] < min)
        {
            r[idx] = min;
        }
        else if (s[idx] > max)
        {
            r[idx] = max;
        }
        else
        {
            r[idx] = s[idx];
        }
    }
}

The NaN also vanished just like that. Two critical fixes within 90 minutes.

The Void

The next run of the program I could complete the network training and could generate few images. At first everything looked just the same, a full white Rectangle indicating that something is off.

I put some logs and it brought my old friend - 'Normalization/Denormalization' pair. This time, lack of normalization was the issue. Once reintroduced, things went smooth.

The 5 Hours of Suspense

Finally Iron Learn was ready. I took it for a spin. I spent another 5 hours training the network, this time with more layer (7 layers). Built a 99k parameter image model and here you go with the results:

The Static Network Started With

The Final 200×200 Reconstruction

The Timeline

It is satisfying to watch a neural net learn and here is how Simba Network learnt:

The Reconstruction on 512×512

I did not stop just at reconstruction, I tried with different sizes too. Here is one with 512×512 reconstruction.

Updated Inventory Check (One Last Time):

Rust: Tamed.
GPU Kernels: Synced.
Simba: Reconstructed.
Black Box: Demystified.

Part 9: Generating Simba Network with Rust

Palash Kanti Kundu — Thu, 01 Jan 2026 23:56:30 +0000

Generating Simba Network with Rust

After successfully approximating the lottery math function (check previous post for context), I decided to challenge the network more. There are so many functions to choose from. Mathematics has evolved a lot from the inception and honestly, I know very little about them. Proving them would be another chore for me. I needed something interesting, something exciting. It struck me, image is nothing but a 2D plot of an arbitrary and arguably very complex function.

What if I could ask the network to approximate it?

I found a Lion Cub drawing in Black and White and used the following encoder/decoder script to encode/decode pixels into numbers and vice versa: Image to CSV Encoder Decoder

Once done, I again launched my Python script to feed the generated CSV data to the neural network.

Helping Machine to Draw

The script struggled at lot of points and I had to fix those to help machine learn how to draw Simba.

The Large Image Issue

The original downloaded image was a large 474×474 pixels. It was taking very long to train. To avoid this issue, I had to resize it to 200×200 pixels.

The Machine Crash and Restart

I had occasional machine crashes due to overheating and every time that happened training started from scratch. This was a huge waste of time and resources. So, I added a save/load mechanism to resume the learning from the last saved checkpoint. The machine can now take a pre-saved model and can start from there.

The Error Oscillation

No matter how small a learning rate I chose, the training always was getting stuck into the error oscillation loop. At one point, it occurred to me that, if I choose very small learning rate like 0.000001 I could save the training but definitely, that would take me longer. Then I thought, what if I can gradually decrease the learning rate programmatically. I did some research on my thoughts and found about Cosine Annealing. I applied the cos learning decay function and it started showing smooth learning.

decay_factor = 0.5 * (1 + math.cos(math.pi * i / (epochs+epoch_offset)))
current_lr = lr_min + (lr_max - lr_min) * decay_factor

The Result: Art through Math

After fixing all the issues and running the network for almost 1.2 million iterations (around 4 hours on my machine), I could see generated output very close to the input image. The input was a very complex high-dimensional function, far beyond simple XOR gate test set or even the logistic regression dataset. This was a practical demonstration of the Universal Approximation Theorem: the idea that a neural network can represent almost any continuous function. I can now use this network to do bigger things.

For comparison, here are the results:

Original Image

The Initial Static

Final Image on 200x200

Reconstruction on Higher Resolutions

At that point, I was pretty sure, the network learnt the underlying function. With that confidence and the saved weights, I tried to test it against different blank canvas sizes like 512, 50, 1024 etc. In every blank canvas it drew the image.

Following is the same image reconstructed on 1024x1024 resolution:

Since the network had learned the mathematical concept of the lines rather than just the pixels, the 1024x1024 version didn't look 'pixelated' like a standard zoom—it looked like the network was redrawing its own masterpiece on a bigger canvas.

Validation and Next Steps

I basically made a highly complex, inefficient, uber expensive image scaler. The result was satisfying but not perfect. It proved the point but I needed perfection.

I posted the resultant image on Reddit and another redditor commented about SIREN or SInusoidal REpresentation Networks.

SIREN is a neural network which uses Sine activation function instead of ReLU or Tanh. Mostly used for purpose of Implicit Neural Representation, a technique very similar to what I was trying to achieve. SIREN is more effective than other neural networks in representing Images, Audios etc.

I implemented a SIREN in python with the hope of reconstructing the image to a more perfect one. But my efforts were in vain, it did not work. I finally abandoned the plan of writing SIREN after few failures to pursue something else.

While my network used Sigmoid, SIREN uses Sine waves, which are naturally better at capturing the 'sharp edges' of a drawing. I eventually moved on from SIREN after a few failed attempts, but the experience changed how I looked at the 'frequency' of my data..

In my next attempt, I actually achieved a sharper, more detailed reconstruction.

The Rust Comeback

The idea I implemented in Python showed some fruitful results. The success of the Python script rejuvenated me. I was ready to take the next challenge. I braced myself to pour some energy into the Rust program. I would have missed the adventure and the learning if I did not come back to Rust.

The journey again resumed which had been paused for few days. I wore the Rustacean hat and it took me another week to put everything in place:

I wrote a separate Tensor trait and put all the defined methods in it
I wrote a CpuTensor struct and the Tensor trait implementation for it
I wrote a GpuTensor struct and the Tensor trait implementation for it

The Initial Shock

I was expecting my Rust program to work seamlessly out of the box. Then came the next shock. GPU Tensor was taking 90+ seconds to run the same network, which my python program was taking only 8.

Another challenge to solve. Another debugging session.

I tried to find the reason. The Rust code showed nothing, except that every Tensor operation was taking a long time to compute. I was very surprised. I doubted my CUDA Kernel programs and used nsys profiler (the NVIDIA Nsight Systems profiler).

The result was a surprise for me, the major time consuming part of my application was not the CUDA Kernels but the memory allocation and deallocation.

Then I tried to reason it with cupy. It also needs memory to perform its operations, then how is it so fast?

The answer lay in the Memory Pool. The library does not depend on default memory allocator, rather it uses a custom memory pool

A logical approach for me was be to look for a memory pool, but I found no direct memory pool implementation in the cust library. I first thought of implementing my own but it would be very painful and error prone. I kept on looking for a solution. I finally found it. The cust library might not have a memory pool wrapper but CUDA library did so and cust library re-exports those modules under sys. It was a huge relief for me. But still it was far from easy to implement.

I tried incorporating the memory pool. I had to do a lot of consultation with the documentation to finally make it work.

Here is a snippet of the code:

pub fn get_mem_pool() -> CudaMemoryPool {
        let device = Device::get_device(0).unwrap();

        // Create a memory pool for the device
        let mut pool = std::ptr::null_mut();
        let pool_props = CUmemPoolProps {
            allocType: cust::sys::CUmemAllocationType::CU_MEM_ALLOCATION_TYPE_PINNED,
            handleTypes: cust::sys::CUmemAllocationHandleType::CU_MEM_HANDLE_TYPE_NONE,
            location: cust::sys::CUmemLocation {
                type_: cust::sys::CUmemLocationType_enum::CU_MEM_LOCATION_TYPE_DEVICE,
                id: 0,
            },
            win32SecurityAttributes: std::ptr::null_mut(),
            reserved: [0u8; 64],
        };

        unsafe { cuMemPoolCreate(&mut pool, &pool_props) };

        let reserve_size: usize = 2048 * 1024 * 1024;
        let mut reserve_ptr: CUdeviceptr = 0;
        unsafe {
            // This is often a synchronous call initially, but it gets the memory from the driver
            // and makes it available to the pool.
            cuMemAllocFromPoolAsync(
                &mut reserve_ptr,
                reserve_size,
                pool,
                std::ptr::null_mut(), // Null stream is okay for one-time setup
            );
            // You MUST synchronize the null stream here to ensure memory is available
            cuStreamSynchronize(std::ptr::null_mut());

            // Now free it back to the pool immediately for reuse
            cuMemFreeAsync(reserve_ptr, std::ptr::null_mut());
            cuStreamSynchronize(std::ptr::null_mut());
        }

        println!("Memory pool created for device {}", device.name().unwrap());

        CudaMemoryPool {
            pool: Arc::new(Mutex::new(UnsafeCudaMemPoolHandle(pool))),
        }
}

Once it was done in a main program outside of my Tensor, I could see thousands of memory blocks allocated and deallocated in milliseconds. Of course, it had an initial price to pay for the Memory Pool creation but in most cases, it would be a one time setup cost.

Integration in GpuTensor

I took the code and put inside my library.

At that point, I was wise enough to not believe it would work on the first try. Unsurprisingly, when I incorporated the code in my Tensor library, it did not work. It was still taking 90+ seconds. I suspected the lifetime of the pool. So I decided to keep the memory pool alive just like I did for context.

Copy, Paste...

And...

Compiler Error.

To solve the issue, I had to use Arc<Mutex<>>. And the memory pool was kept alive throughout the application uptime. However, this did not solve the problem. The actual issue was somewhat deeply nested in the CUDA wrappers. The wrappers themselves were using the default allocator and not the pool.

And I opened another Pandora's Box.

I had to deal with raw pointers. It was the first time, I actually worked with Raw Pointer in Rust. It was a very scary experience but I survived it and wrote a custom device buffer to link between Host Memory and Device Memory:

impl<T: Numeric + DeviceCopy> Drop for CustomDeviceBuffer<T> {
    fn drop(&mut self) {
        let pool = match &GPU_CONTEXT.get().expect("No GPU Context Set").pool {
            Some(p) => p,
            None => panic!("Cuda not initialized or Gpu Pool is not set up"),
        };

        let _ = pool.free(self.as_device_ptr().as_raw());
    }
}

pub fn get_device_buffer<T: Numeric + DeviceCopy>(size: usize) -> CustomDeviceBuffer<T> {
    let pool = match &GPU_CONTEXT.get().expect("No GPU Context Set").pool {
        Some(p) => p,
        None => panic!("Cuda not initialized or Gpu Pool is not set up"),
    };

    let ptr_size = size.checked_mul(size_of::<T>()).unwrap();

    if ptr_size == 0 {
        panic!("Attempted a zero size pointer or null pointer creation.");
    }

    let pool_allocated_pointer = pool.allocate(ptr_size).unwrap();
    let device_pointer = DevicePointer::from_raw(pool_allocated_pointer);

    let device_buffer = unsafe { DeviceBuffer::from_raw_parts(device_pointer, size) };

    CustomDeviceBuffer { device_buffer }
}

The Raw Pointer Size Issue

The deal with raw pointers gave me access to low level memory management and fast implementation but I was thrashed by pointer size mismatch issue. I dug into the code of cust wrappers. I found the issue, I was allocating for the length of the array but conveniently forgot to account for the byte-width of the data type.

What I was doing is essentially this, for an array of f32 of length 1, I was allocating 1 byte of memory in the GPU rather than 4 bytes. I fixed the mismatch and hoped high that this version will work. It worked but execution time did not go down.

I was literally frustrated and thought of abandoning the project creeped up. I called it a day.

The Hidden Bug

The next morning, with a fresh mind, I started debugging with nsys. It was very hard to point out. After a whole day's worth of effort, I finally found the hidden issue: Tensor::ones. To calculate Sigmoid, the network needed a tensor of 1s. To achieve this, my implementation was creating a Vec<f32> with 1s and was transferring the Vec to the Device, making a H2D copy at every epoch.

I wrote a new kernel to initiate memory in GPU with provided value:

extern "C" __global__ void fill_value(float *out, int n, float value)
{
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n)
    {
        out[idx] = value;
    }
}

This solved the issue of H2D copy and brought down the execution time to 54 seconds, far from 8 seconds.

Another Costly Operation

I was on the lookout for the issue, again took help from nsys profiler. This time it showed D2H copy. I found that sum reduction function (similar to np.sum()) was behind those copies. As sum function is an aggregate function and GPU works on a thread based execution principle, initially I thought of doing this calculation in CPU but that backfired heavily. This function gets called on each epoch for loss calculation. A drop in even a few milliseconds would bring down seconds on training for 1000 epochs.

So I wrote a column based reducer instead:

extern "C" __global__ void column_reduce(const float *inputMatrix, float *outputSums, int numRows, int numCols)
{
    int col = blockIdx.x * blockDim.x + threadIdx.x;

    if (col < numCols)
    {
        float sum = 0.0f;

        for (int row = 0; row < numRows; ++row)
        {
            sum += inputMatrix[row * numCols + col];
        }

        outputSums[col] = sum;
    }
}

This cut down some. From 54 seconds, this change brought down to 45 seconds.

The Matrix Multiplication Refactoring

I never stopped taking help from nsys. This time it indicated cuLaunch performance issue. The hand-written kernels were making a mess. The usual suspect was the Matrix Multiplication kernel. I replaced the tiled matrix multiplication routine to older thread based non-tiled routine. Not much change, dropped from ~45 seconds to ~34 seconds. Still not hitting the target. I changed some other kernels. Not much change, so reverted all these and settled down with whatever I had prior to those changes.

However, the lag was real and I could not sit quietly on it. I was getting closer. So I tried to use cuBLAS. The cust library does not support cuBLAS. I had to resort to cublas-sys to get cuBLAS working. The road was not smooth. However, after 2 hours of hiccups, I finally managed to integrate cuBLAS.

The following snippet got the job done:

fn _gpu_mul_cublas(&self, rhs: &Self) -> Result<Self, String> {
        let m = self.shape[0] as i32;
        let k = self.shape[1] as i32;
        let n = rhs.shape[1] as i32;

        let total_elements = (m * n) as usize;
        let result = get_device_buffer(total_elements);

        let alpha = T::one();
        let beta = T::zero();

        unsafe {
            cublasSgemm_v2(
                Self::_get_cublas_handle(),
                cublasOperation_t::CUBLAS_OP_N,
                cublasOperation_t::CUBLAS_OP_N,
                n,
                m,
                k,
                &alpha.f32(),
                rhs.device_buffer.as_device_ptr().as_raw() as *const f32,
                n,
                self.device_buffer.as_device_ptr().as_raw() as *const f32,
                k,
                &beta.f32(),
                result.as_device_ptr().as_raw() as *mut f32,
                n,
            );
        }

        let result_shape = vec![self.shape[0], rhs.shape[1]];
        Ok(Self::_with_device_buffer(result_shape, result))
}

However, I was heartbroken initially as cuBLAS Version took almost the same time and sometimes more than my naive implementation. I tried to profile. Nothing suspicious but not much change either.

Then I started reading about it. cuBLAS works best with really big matrices, which is not the case with the XOR test dataset. Fine for me. It works and I will definitely benefit when I tackle the much larger image reconstruction task.

The Conclusion

The story did not end there but energy and motivation definitely was taking a hit at that point. Another two weeks passed already after I resumed working on Rust Tensor Library and still the Simba image reconstruction seemed a far fetched dream. I basically built a high-performance, custom-tailored engine from the ground up. I’ve moved past the "Initial Shock" of poor performance and tamed the memory allocation beast. While the perfect reconstruction of Simba was still a few training runs away, the infrastructure was capable of handling the load.

Part 8: Proving the Universal Approximation Theorem with Rust

Palash Kanti Kundu — Thu, 01 Jan 2026 14:33:11 +0000

Once I was done with the CUDA Integration in Python, it was time to return to the Rust program. I had to replicate the Neural Network into my Rust program.

I took a look at the code base, all the code was dumped into one single file, violating my personal coding hygiene of 'no more than 500 lines in a file, until absolutely necessary'. Apart from that, the way the logistic regression program was written in raw CUDA launches without following any structure made me concerned that I would have to write duplicate modules for Neural Network to support both CPU and GPU.

I needed a plan to make things unified at the Tensor level, beyond that, things can differ but any high level implementation should call Tensor modules and things should work without worrying about where the math is being performed.

The Plan

Make a TensorFamily enum to determine where the mathematics would execute
Write a TensorBackend trait which holds all the methods of Tensor
Implement the TensorBackend trait for CpuTensor and GpuTensor structs
Finally, rewrite the Tensor struct which works as the Factory

While doing so, I stumbled upon some new learning in Rust - the dynamic trait and another rabbit hole...

The Pivot

The plan looked simple on paper. However, it did not work in the real code. The compiler came back with multiple errors. I tried to use dyn trait for TensorBackend. I tried to resolve a few. Some I fixed, I understood a few new concepts and why Rust prevented me from using a recursive memory allocation pattern and I got stuck. Compiler was very reasonable but I was being completely unreasonable in my plan.

After around two hours of fighting with the compiler, I again had another thought of shutting down the project. I questioned my choices and left the desk for a walk around the block.

There came the solution, I don't need to make the actual Tensor unified. No matter what I did, I would still need to make two different execution methods, one for CPU and another for GPU. The user of the library (ironically, that is just me), would make the choice of using GPU or CPU on their work load. They may have installed high end GPUs but for a simple XOR operation Neural Network test, the GPU will actually be slower. I should not make assumptions and must leave the choice to the user.

With this newfound reconciliation, I returned to my desk. I devised a new design altogether, where a consistent set of methods will be exposed for both hardware types. Only difference would be: the CPU-bound tensor can query the memory immediately and return result while the GPU-based tensor needs an explicit D2H data copy mechanism. Until the D2H call happens, all data resides on GPU memory.

With this new idea, I abandoned all my plans and started fresh with writing GPU-based program separately.

However, this also did not go well, after a few more rounds of errors I stopped GPU programming completely.

It was really a devastating and deeply demoralizing moment for me. My dream was shattering in front of my eyes. I knew there was no way a heavy workload would be completed by my CPU. I need to work on the GPU side. But something in my mind told me quietly, 'don't worry, you will do it, but just not right now'. Somehow, I followed my inner voice and kept aside my thinking brain for few hours. I wrote the Rust CPU-bound neural network, following the Python script.

/// Element-wise sigmoid activation.
pub fn sigmoid<T>(input: &T) -> Result<T, String>
where
    T: CpuTensor<f32>,
{
    input.sigmoid()
}

/// Derivative of sigmoid; expects the activation output as input.
pub fn sigmoid_prime<T>(output: &T) -> Result<T, String>
where
    T: CpuTensor<f32>,
{
    let one_minus_out = T::ones(&output.get_shape()).sub(output)?;
    let res = output.multiply(&one_minus_out);

    res
}

/// Element-wise hyperbolic tangent activation.
pub fn tanh<T>(input: &T) -> Result<T, String>
where
    T: CpuTensor<f32>,
{
    input.tanh()
}

/// Derivative of `tanh`, expects activation output as input.
pub fn tanh_prime<T>(output: &T) -> Result<T, String>
where
    T: CpuTensor<f32>,
{
    let out_squared = output.multiply(output)?;

    let ones = T::ones(&output.get_shape());

    ones.sub(&out_squared)
}

/// Trait describing a loss function used for training and backpropagation.
///
/// Implementors must provide methods to compute the scalar loss tensor and
/// the derivative (loss prime) used as the starting point for backprop.
pub trait LossFunction<T>
where
    T: CpuTensor<f32>,
{
    /// Calculates the loss value (used for reporting).
    fn loss(&self, actual: &T, predicted: &T) -> Result<T, String>;

    /// Calculates the derivative of the loss w.r.t the predicted output (used for backpropagation).
    fn loss_prime(&self, actual: &T, predicted: &T) -> Result<T, String>;
}

/// Mean squared error loss implementation.
pub struct MeanSquaredErrorLoss;

impl<T> LossFunction<T> for MeanSquaredErrorLoss
where
    T: CpuTensor<f32>,
{
    fn loss(&self, actual: &T, predicted: &T) -> Result<T, String> {
        let error_diff = predicted.sub(actual).unwrap();
        let sq_err = error_diff.multiply(&error_diff).unwrap();

        let length = sq_err.get_shape().iter().product();

        sq_err.sum().unwrap().scale(1.0 / (length as f32))
    }

    fn loss_prime(&self, actual: &T, predicted: &T) -> Result<T, String> {
        let n = actual.get_shape().iter().product();
        let factor = 2.0 / (n as f32);

        predicted.sub(actual).unwrap().scale(factor)
    }
}

/// Fully-connected linear layer holding weights and an optional input cache.
pub struct LinearLayer<T>
where
   T: CpuTensor<f32>,
{
    weights: T,
    input_cache: Option<T>,
    name: String,
    layer_type: LayerType,
}

impl<T> LinearLayer<T>
where
    T: CpuTensor<f32>,
{
    fn _initialize_weights(
        input_size: u32,
        output_size: u32,
    ) -> Vec<f32> {
        let mut rng = rand::rng();
        let normal = Normal::new(0.0, 1.0).unwrap();

        let w_data: Vec<f32> = (0..(input_size * output_size))
            .map(|_| normal.sample(&mut rng) as f32)
            .collect();

        w_data
    }
    pub fn new(
        input_size: u32,
        output_size: u32,
        name: &str,
    ) -> Result<Self, String> {
        let w_data = Self::_initialize_weights(input_size, output_size);

        let weights = T::new(vec![input_size, output_size], w_data).unwrap();

        Ok(Self {
            weights,
            input_cache: None,
            name: name.to_string(),
            layer_type: LayerType::Linear,
        })
    }

    fn name(&self) -> &str {
        &self.name
    }

    fn forward(&mut self, input: &T) -> Result<T, String> {
        self.input_cache = Some(input.add(&T::zeroes(input.get_shape()))?);
        let matmul = input.mul(&self.weights)?;
        Ok(matmul)
    }

    fn backward(&mut self, output_error: &T, lr: f32) -> Result<T, String> {
        let input = self.input_cache.as_ref().ok_or("No forward pass cache!")?;

        // Calculate Input Error: error * weights.T
        let w_t = self.weights.t()?;
        let input_error = output_error.mul(&w_t)?;

        // Calculate Weights Gradient: input.T * error
        let input_t = input.t()?;

        let weights_grad = input_t.mul(output_error)?;

        // Update Parameters
        let w_step = weights_grad.scale(lr)?;
        self.weights = self.weights.sub(&w_step)?;

        Ok(input_error)
    }
}

/// Activation wrapper layer that applies element-wise activation functions.
pub struct ActivationLayer<T>
where
    T: CpuTensor<f32>,
{
    layer_type: LayerType,
    output_cache: Option<T>,
    name: String,
}

impl<T> ActivationLayer<T>
where
    T: CpuTensor<f32>,
{
    /// New takes two function pointers as input
    pub fn new(name: &str, layer_type: LayerType) -> Self {
        Self {
            output_cache: None,
            name: name.to_string(),
            layer_type,
        }
    }
}

impl<T> Layer<T> for ActivationLayer<T>
where
     T: CpuTensor<f32>,
{
    fn name(&self) -> &str {
        &self.name
    }

    fn forward(&mut self, input: &T) -> Result<T, String> {
        let (activation, _) = get_activations(&self.layer_type);
        // Call the passed-in activation function
        let output = (activation)(input)?;

        // Caching the output for the backward pass
        self.output_cache = Some(output.add(&T::zeroes(output.get_shape()))?);
        Ok(output)
    }

    fn backward(&mut self, output_error: &T, _lr: f32) -> Result<T, String> {
        let out = self
            .output_cache
            .as_ref()
            .ok_or_else(|| "No output cache found for backward pass".to_string())?;

        let (_, activation_prime) = get_activations(&self.layer_type);

        // Call the passed-in activation prime function
        // Note: Many derivatives (like sigmoid/tanh) use the output 'y' rather than input 'x'
        let prime = (activation_prime)(out)?;

        prime.multiply(output_error)
    }

    fn layer_type(&self) -> &LayerType {
        &self.layer_type
    }
}

pub struct NeuralNet<T>
where
    T: CpuTensor<f32>,
{
    pub layers: Vec<Box<dyn Layer<T>>>,
    pub loss_fn: Box<dyn LossFunction<T>>,
}

impl<T> NeuralNet<T>
where
    T: CpuTensor<f32>,
{
    pub fn new(
        layers: Vec<Box<dyn Layer<T>>>,
        loss_fn: Box<dyn LossFunction<T>>,
    ) -> Self {
        Self {
            layers,
            loss_fn,
        }
    }
    /// Append a layer to the network.
    ///
    /// The provided box must implement the `Layer` trait for the network's
    /// tensor type `T`.
    pub fn add(&mut self, layer: Box<dyn Layer<T>>) {
        self.layers.push(layer);
    }

    /// Run a forward pass and return the network output for `input`.
    pub fn predict(&mut self, input: &T) -> Result<T, String> {
        let mut output = input.add(&T::zeroes(input.get_shape())).unwrap();

        for layer in &mut self.layers {
            output = layer.forward(&output).unwrap();
        }
        Ok(output)
    }

    pub fn fit<F>(
        &mut self,
        x_train: &T,
        y_train: &T,
        epochs: usize,
        base_lr: f32,
    ) -> Result<(), String>
    where
        F: FnMut(usize, f32, f32, &mut Self),
    {
        let lr_min = 1e-6;

        for i in 0..epochs {

            print!("\rProcessing epoch: {}/{epochs}", i);
            io::stdout().flush().unwrap();

            let mut output = x_train.add(&T::zeroes(x_train.get_shape())).unwrap();
            for layer in &mut self.layers {
                output = layer.forward(&output).unwrap();
            }
            T::synchronize(); // no-op for CPU but needed here to maintain consistency with GPU

            let err = self.loss_fn.loss(y_train, &output);
            T::synchronize();

            let mut error_prime = self.loss_fn.loss_prime(y_train, &output).unwrap();

            for layer in self.layers.iter_mut().rev() {
                error_prime = layer.backward(&error_prime, base_lr).unwrap();
            }
            T::synchronize();
        }

        T::synchronize();
        Ok(())
    }
}

After around three hours, I was able to finally run my first Rust Neural Network program. Things went pretty well. To be honest, I never expected it to go so smooth. I ran the XOR test in both Python script and Rust program. They showed not exact but very similar results. It was not the exact same result because, the weight initialization followed random sequence without the same seed.

Another success, another idea, another play time...

Approved!!!

The Universal Approximation Theorem

The Universal Approximation Theorem states that, "Given at least one hidden layer in a neural network and enough time neural networks can approximate any continuous function".

Well, now I have a neural network running and it passed the non-linearity test with XOR operation. I can leave the computer switched on overnight to approximate any function. So, why not try it?

I needed a function to be approximated. I became a little dramatic here.

I wrote a few chits writing 1 to 10, +, -, *, /, ^ and put them in a bowl and picked up 15 times. Please don't judge me. I still have no answer why I did that.

Anyways, I came up with this equation:

a(x) = (x² + x³ + x⁴ + x⁶) / x⁵

And this gave me this plot

I made an arbitrary rule, the blue points are true and red points are false. I sampled 25 points for training and 6 for testing and tested my neural network against it.

First attempt went unsuccessful. I could not find the reason. I tried it with the Python script. That also failed. At that point, I had doubt if UAT actually can be applied to this function.

I needed an answer. To solve the tension between my program output and UAT, I wrote another sklearn script to run against the same data. This time it went successful, proving my programs were incorrect.

Here comes the debugging hat...

A few print() and println!() statements later, I found the issue. The culprit surprisingly was Normalization. It helped me in all earlier cases but this time, it failed me.

I was normalizing the input and denormalizing the output, which made the prediction result incorrect. I removed the normalization and denormalization layer and it started working.

Lesson learnt, not all the time you would need normalization and denormalization.

Voila!!!


╔════════════════════════════════╗
║ Iron Learn v5
║ Mode: CPU
║ Learning Rate: 0.0001
║ Epochs: 100
║ Data Path: data.json
╚════════════════════════════════╝

Predicted: 0.0000, Actual: 0.0000, ✓
Predicted: 0.9999, Actual: 1.0000, ✓
Predicted: 1.0000, Actual: 1.0000, ✓
Predicted: 0.0000, Actual: 0.0000, ✓
Predicted: 0.0024, Actual: 0.0000, ✓
Predicted: 0.0000, Actual: 0.0000, ✓

The output confirms two things:

Universal Approximation Theorem holds true
My Neural Network implementation is correct

Inventory Check

Another day just passed by fixing things. Not only in code but also in my mind. The day played heavy with my emotions.

Anyways, at that moment I had all these in the inventory:

A CPU Linear regression
A CPU Logistic Regression
A GPU Linear Regression
A GPU Logistic Regression
A Python script to train GPU powered neural network
A Rust program to train Neural Network using CPU

At that point, the math was ready, the code was ready, the network proved to be working. I was ready to do something more with it, and I did. Something unexpected happened and I stumbled upon some great knowledge chunks.

Part 7: CUDA Integration with Python

Palash Kanti Kundu — Wed, 31 Dec 2025 15:30:24 +0000

After successfully setting up the Neural Network, I tested it with XOR operation. XOR is a non-linear operation. So, it's kind of "Hello World" in context of finding that the network works and can detect non-linearity. The test ran smoothly, indicating that the network is ready to take on bigger assignments.

I had to replicate the same in Rust but my mind wouldn't let me do that. Whenever I tried to write a single line of Rust code, my mind started questioning the following:

What would happen, if I fed it a complex equation?
What would happen if I ran the linear regression data through neural network?
What would it do with the logistic regression dataset?
Will it work with CUDA?
Will the linear regression program work with CUDA?

So many unanswered questions. It broke my flow.

The CUDA based Linear Regression Program

As I was already working with CUDA at that point, the most low hanging fruit for me was to integrate CUDA into both Python and Rust versions of my library. I switched over to the linear regression program to execute it on the GPU.

(Un)surprisingly, this did not work. My CPU-bound program was giving very low Root MSE - 7.75 but the GPU version was returning 40.

Another debugging session...

After searching the code line by line, I found I was returning a zeroed matrix back from GPU in my linear prediction function. I fixed it to return the actual output instead and it started behaving normally.

Results:
Total samples: 93
Mean Squared Error: 59.6779
Root MSE: 7.7251

Similarly, I wrote another CUDA program for logistic regression and it also went fine. Unfortunately, I somehow missed capturing the results of the CUDA Logistic Program.

One question shot down: Rust CUDA program can work with the regression datasets. Let's move on.

CUDA integration with python

Once I was happy with the results, I turned to the Python neural network script used for the XOR test dataset. I had already worked with bigger datasets. So running the simple XOR test dataset, did not feel right to me. I planned to run the linear regression and logistic regression datasets on the neural network too. Ideally, they should run properly without issue.

I wired the JSON file to the script and started training the neural network.

Oh boy, I just opened another rabbit hole.

No seriously...

On a huge data load, CPU is not powerful enough to work sequentially, even though all the miniscule parallellization numpy does on the array to distribute the load across CPUs. Apparently, the email spam/ham dataset was the tipping point— my CPU could not handle the load anymore. I tried to find a solution in the scikit-learn. I found that the library has limited GPU Support through cupy and there are some setup challenges. As I already understood the maths and I already had a numpy version of Neural Network program handy with me, it was easier for me to switch my numpy program to cupy. Obviously, I would miss a lot of optimizations but that would give me the push to learn optimization techniques down the road too.

Solution was planned, I fired up my WSL, installed cupy and swapped numpy with cupy in the script and there lay another bit of quicksand which drowned me again.

The same program that worked within 10 minutes with thousands of iterations in the numpy version was not even running 1000 iterations in 10 minutes in cupy version. Thankfully, been there earlier with my Rust code. It clicked almost instantaneously: host-to-device (H2D) and device-to-host (D2H) data copy overhead.

I ran back to the library documentation, found a fix for that and applied it. There were two adjustments that were needed, first to remove the error calculation at every epoch to avoid the D2H copy and the other was to use synchronize after a set of operations instead of each epoch.

The Takeaway

After fixing the H2D and D2H copy overhead, the program worked decently. I was able to run the same script now within 10-15 seconds, a mere 60x speedup.

Honestly, I got addicted to the speed...

After a long day of fighting through setup and debugging, a little play time was approved.

I just started playing with it. First a single layer. Not much load. I then added two more layers, just for fun.

With this, the neural network brought down the MSE of the linear regression dataset even lower

Test MSE: 52.3265 (From earlier 59.6779)
Test MAE: 5.3395

The effort was worth it once again. Seeing the network learn step by step in each output, was satisfying. And with the speed, I could choose a lower learning rate and higher epoch count (number of training iterations), the network configurations, number of layers, number of nodes in each layer etc.

Epoch 1/200000 | Error: 25.926468
Epoch 1000/200000 | Error: 0.482236
Epoch 2000/200000 | Error: 0.414885
Epoch 3000/200000 | Error: 0.377820
Epoch 4000/200000 | Error: 0.354329
Epoch 5000/200000 | Error: 0.340112
Epoch 6000/200000 | Error: 0.331060
Epoch 7000/200000 | Error: 0.324392
Epoch 8000/200000 | Error: 0.319276
Epoch 9000/200000 | Error: 0.315130
Epoch 10000/200000 | Error: 0.311793
Epoch 11000/200000 | Error: 0.308888
Epoch 12000/200000 | Error: 0.306242
Epoch 13000/200000 | Error: 0.303405
Epoch 14000/200000 | Error: 0.300487
Epoch 15000/200000 | Error: 0.298240
Epoch 16000/200000 | Error: 0.296392

While looking into the errors, I noticed how Gradient Descent works. At first, the errors are high and the network corrects quickly towards convergence but as time passes by and the network learns, the errors go down and so does the corrections.

I tried different learning rates. When I chose a bigger learning rate, the neural network was not converging, it was going back and forth between two points and finally returned more errors. On the other hand, having a smaller learning rate, gave me a smoother convergence but took very long to converge.

I also noticed that even with a tiny learning rate of 0.005, the network would almost stabilize after 20,000 epochs, yet it could still show slight oscillations.

Another observation was that running more and more training loops does not change the Mean Absolute Error too much. At some point, saturation is inevitable.

The best result I got with 2 hidden layers are as follows:

Training completed in 285.5212 seconds.

Final Results after 100000 epochs and learning rate 0.001:

Test MSE: 46.2653
Test MAE: 5.2730

This result is not vastly different from,

Training completed in 117.2746 seconds.

Final Results after 40000 epochs and learning rate 0.005:

Test MSE: 52.6716
Test MAE: 5.3199
Starting training...

I also tried removing layers, the best result I got using just one hidden layer is as follows:

Training completed in 77.6143 seconds.

Final Results after 40000 epochs and learning rate 0.005:

Test MSE: 45.5669
Test MAE: 5.1707

By the time, I was done with all my experiments, already 6 hours had passed. All these insights helped me understand Neural Net a little better than before.

I always doubted the performance of my earlier Logistic Regression program written in Rust. With successful CUDA integration of my program, it was time to do a fair comparison. I made a logistic regression program in the neural network by just using one layer of linear layer and one layer of sigmoid layer.

The result surprised me, my inefficiently written CUDA program was actually sailing faster than my cupy program.

I did not bother to find the reason, I was done for the day.

Before I log off for the day, here is a quick inventory check:

A CPU Linear regression
A CPU Logistic Regression
A GPU Linear Regression
A GPU Logistic Regression
A Python script to run GPU powered neural network

And you know what, this built the perfect foundation for more experimentation. Stay hooked to know more about what's next.

Part 6: Building the First Neural Network

Palash Kanti Kundu — Thu, 04 Dec 2025 13:42:25 +0000

My program was running successfully and was using GPU to do the heavy lifting and bring the complexity down from $O(n^3)$ to a highly parallelized, faster execution. Having successfully integrated the CUDA code into the logistic regression module, I was happy. I know my GPU kernel programs are not the best in optimization standards, but I really didn't care too much at that point. I had sufficient performance boost to run 20000 training loops under 10 seconds, which was taking almost 4 hours a few days ago.

I'd rather deep dive into learning the next step of machine learning than focus on making my code super efficient.

But there is no harm in doing some kind of baseline check while my machine does the hard work and I take rest. It was so satisfying to see GPU getting used by my own program, I could not resist the urge, I bumped up the number of epochs to 10 million. My GPU frequently went 100% usage. Wow!!! A cherishable moment.

10 million iterations took almost 45 minutes in my machine, still less than the time it took for CPU to run only 5000 iterations. Accuracy also hit 93.09% this time.

After experimenting with other data and fine-tuning a few hyperparameters, I took my next step in the journey - building a neural network.

At that point my inventory included quite a few modules:

An almost accurate CPU Powered Tensor library
A Gradient Descent implementation
Few reusable CUDA kernel programs
A complete Gradient Descent orchestrator
A logistic regression program which returns accuracy almost identical to sklearn library.

The Maths Behind

The main objective of machine learning is to minimize the loss, calculated by the difference of actual test values and predicted values. To do this we heavily rely on mathematics.

Linear Regression

For a refresher, let's revisit how Linear Regression works.
The objective of Linear Regression is to find a fitting straight line through the data. The equation for a line is represented as follows:

$$
y = mx + c
$$

In linear regression, we have to find the $m$ and the $c$ from $y$ and $x$ where $x$ is input data and $y$ is the output for that $x$.

Linear Regression starts with the inputs($x$) and some random weight matrix ($m$) and bias matrix ($c$) and follow a series of matrix multiplication and addition routine to arrive at the prediction. Then it checks the predicted result with actual target values, taken from the test data and calculates the loss.

The loss is then used to manipulate the weights using calculus by finding the minima. And it does the same process over multiple times. Eventually the predicted output starts matching very close to the actual output.

Logistic Regression

The logistic regression also does the same thing, only with an extra step. It does the same $y = mx + c$ calculation and then it wraps the result into a sigmoid function:

$$
\sigma(x) = \frac{1}{1 + e^{-x}}
$$

Then it measures the loss and follows the same process as earlier to get into an optimal solution.

Neural Network

So far so good, we can predict a continuous variable which follows a straight line, we can predict a binary value.

However, in real world, not everything is either a line or in simple $true$/$false$ category. We can have data that follows a non-linear path. So, we need to introduce non-linearity in our process too. We do that by wrapping our linear function output in a non-linear function. And it has been observed that, if we stack multiple layers of the chain of linear and non-linear functions, we can mimic any arbitary function by looking at the input and output.

That's exactly what is done in a neural network.

Well, so we get the non-linear output following a series of different linear and non-linear function outputs. But, to minimize the loss, we have to update the starting weight and bias matrices. But the loss will only be calculated in the end, after the prediction is made.

What could be the solution?

Seems like centuries ago great mathematicians have already solved the problem. The answer is - Chain Rule:
$$
\frac{dy}{dx} = f'(g(x)) \cdot g'(x)
$$

Example
$$
\begin{aligned}
\text{Let } y &= (3x^2 + 5)^4
\text{So, } f'(g(x)) = 4 \cdot (3x^2 + 5)^3 \text{and} g'(x) = 6 \cdot x
\text{then} \quad \frac{dy}{dx} &= 4(3x^2 + 5)^3 \cdot (6x) \
\text{or} \quad \frac{dy}{dx} &= 24x(3x^2 + 5)^3
\end{aligned}
$$

This formula gives us how much impact all the layers have on the final output. Following the same, when we calculate the loss at the last step and send it backward till the first layer, thus minimizing the loss in the final stage.

That's why we call this step of the calculation - Back Propagation.

I have fed it the XOR dataset to test. I reran the program and it ran successfully.

It sparked my curiosity, what happens if I feed it the real email dataset. As I thought, so I did. Well, it took me a while to get that working with real dataset of pass fail and email spam but it worked on both and I got 92.85% accuracy on 1000 training iterations and 0.1 learning rate.

I thought of running it against the linear data set and see what happens. It failed when my activation function was sigmoid. I had to make a few changes in the program to support linear output.

However, I noticed, without GPU support, even my simple numpy based neural network script also crashed my machine. I know numpy under the hood does many CPU optimizations but they are peanuts in front of massive calculation load of O(n^3).

The initial success of the python program pushed me to start a new journey.

Part 5: The Comeback

Palash Kanti Kundu — Thu, 04 Dec 2025 06:32:47 +0000

After failing in my last attempt at integrating the CUDA code into my library, I resorted to the CPU. The logistic regression program was running perfectly fine for a small 100-row, 2-column, dataset. The next logical step was to build a two-layer "toy" neural network.

The neural network doubled the number of matrix multiplications. The logistic regression program was taking around an hour for just 1000 iterations, which already made me impatient. The small two-layer toy neural network took it even further: approximately two hours to run 1000 iterations.

And this was just the beginning, with just two layers. If I actually had to do some complex work, I would have to go beyond two layers and most probably I would need more than one perceptron in each layer, which would contribute to a polynomial growth in the computational complexity of the program.

It felt very frustrating to wait for so long, especially when I could have spent some time and made the GPU work.

The wait time for a decent output was beyond my patience, compelling me to work with smaller datasets, which made all my attempts feel like merely a Hello World program. It appeared crystal clear to me that if I were to complete this project, I would definitely have to push my CPU-based, sequential program to a GPU-based, parallel program.

I rolled up my sleeves again to find a solution.

If NVIDIA CUDA examples can run on the GPU, the individual Rust Program can run on the GPU, why can't my library program run on GPU?

Another Pivot

I changed my course of action. I started simply with element-wise addition this time. I wrote a simple program and ran it for 10 elements: it went fine.

This is the simple CUDA code I used to verify the element-wise addition functionality. It confirmed the basic steps of memory allocation and kernel launch.

#include <cuda.h>
#include <cuda_runtime.h>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>

__global__ void add(const float* A, const float* B, float* C, int N) {
    int rowID = threadIdx.y + blockIdx.y * blockDim.y;  // Row address
    int colID = threadIdx.x + blockIdx.x * blockDim.x;  // Column Address
    int elemID;                                         // Element address

    if(rowID < N && colID < N){
        elemID = colID + rowID * N;                 
        C[elemID] = A[elemID] + B[elemID];
    }
}

int main() {
    int n = 10000;
    float *a, *b, *c;
    float *d_a, *d_b, *d_c;
    float size = n;

    a = (float *)malloc(size);
    b = (float *)malloc(size);
    c = (float *)malloc(size);

    for (int i = 0; i < n; i++) {
        a[i] = i;
        b[i] = i * 2;
    }

    cudaMalloc((void **)&d_a, size);
    cudaMalloc((void **)&d_b, size);
    cudaMalloc((void **)&d_c, size);

    cudaMemcpy(d_a, a, size, cudaMemcpyHostToDevice);
    cudaMemcpy(d_b, b, size, cudaMemcpyHostToDevice);

    add<<<1, threadsPerBlock>>>(d_a, d_b, d_c, n);

    cudaMemcpy(c, d_c, size, cudaMemcpyDeviceToHost);

    for (int i = 0; i < n; i++)
        printf("%f + %f = %f\n", a[i], b[i], c[i]);

    cudaFree(d_a);
    cudaFree(d_b);
    cudaFree(d_c);
    free(a);
    free(b);
    free(c);

    return 0;
}

I bumped the array size to 100,000,000, and it started returning 0 for many elements. I brought it back to lower numbers. It went fine for 10, 100 and 1000 elements, but started returning 0 when I went for 10000.

This was unusual for me. I put on my debugging hat. The first mistake was in calculating memory. I was making a float array, but I was initializing to $n$ (the element count) instead of $n \times sizeof(\text{float})$..

Typical type/memory error...

The Memory Management

I ran the program again, hoping to get a correct response this time. It did not go well this time either.

Then, there it was: I was launching only one block with 1024 threads, instead of launching sufficient number of threads according to the size of my matrix. With this clue, I changed the initialization as follows -

int threadsPerBlock = 1024;
int blocks = (n + threadsPerBlock - 1) / threadsPerBlock;
add<<<blocks, threadsPerBlock>>>(d_a, d_b, d_c, n);

I ran the program with giant arrays of size 100,000,000. It started giving the correct result instead of returning 0. The solution was right there.

Fun time! I played mindlessly with this program until I lost interest in it.

The Integration

Once my last toy stopped amusing me, I went on to build a new one. I took another building block - Rust, and started integrating my Vector Addition program into my library.

The first step was to separate the CUDA kernel from the main logic. I deleted the main function from the cuda code and ran the following to generate the ptx output, which I would consume in the library.

iron_learn/kernels$ nvcc -c matrix_mul.cu -o matrix_mul.ptx --ptx

I also planned to add a new flag in the main runnable app in my Rust program to indicate if the program should be run on CPU or on GPU.

After fixing various issues for around 2 hours, I was finally on track, this time with a flag to indicate whether I wanted the program to run on cpu or gpu.

To do this, I had to go through a few steps. First, I created an application context to keep the CUDA context alive throughout the lifetime of the program. CUDA context is slow to initialize, so it is a wise choice to keep the CUDA context alive.

I used Rust's OnceLock. This ensures thread safety and prevents race condition - ensuring the global CUDA context was initialized only once. It definitely was a great learning altogether. The application context works as a singleton to hold all my application parameters that I wanted to keep alive throughout the entire run.

use std::sync::OnceLock;

#[derive(Debug)]
pub struct AppContext {
    pub app_name: &'static str,
    pub version: u32,
    pub gpu_enabled: bool,
    pub context: Option<cust::context::Context>,
}

// Declare a global mutable static
pub static GLOBAL_CONTEXT: OnceLock<AppContext> = OnceLock::new();

// Initialize the context
pub fn init_context(
    app_name: &'static str,
    version: u32,
    gpu_enabled: bool,
    context: Option<cust::context::Context>,
) {
    let ctx = AppContext {
        app_name,
        version,
        gpu_enabled,
        context,
    };
    GLOBAL_CONTEXT
        .set(ctx)
        .expect("Context can only be initialized once");
}

That kept the CUDA context alive, but after doing dozens of changes, the CUDA code wasn't giving me the proper result. I wrote a C program to validate my CUDA code. It was giving correct result.

The Debugging Hat, One more time 😔

At some point, it struck me: I might be sending data in wrong format. My tensor code was a linear 1D array of data. But the CUDA kernel was treating it as a 2D matrix. With this newfound idea, I started comparing my C code and Rust code.

Few things were off actually:

I was sending wrong matrix row columns for calculations. 🤦
I was using f64 (double precision) but my cuda code was expecting f32 (single precision)
The C Program did not complain and worked fine, as it was already taking float.
What I had to do was to change my CUDA code to standardize the precision to f64, instead of f32

Learning: Precision matters

After successfully integrating the CUDA code in my program, I found it was taking longer than my CPU Code, around 44 seconds for 10 iteration. I had seen this before, it was caused by copying data to device and copying it back to host for each gpu multiplication. So, my CUDA Code may have been performing in microseconds but data transfer was taking longer time, defeating the purpose of vectorization.

I verified my thought process with another log. For 10 iterations, 20 matrix multiplications are running. Around 200 milliseconds are spent on the actual matrix multiplications, while the whole process takes 44 seconds.

So, to get the full benefit, I had to move my whole logic inside the CUDA program.

I wrote few small GPU Modules to run inside cuda

#include <cuda.h>
#include <cuda_runtime.h>

// CUDA kernel for matrix multiplication
extern "C" __global__ void matrixMulKernel(double *A, double *B, double *C, int numARows, int numAColumns, int numBColumns)
{
    int Row = blockIdx.y * blockDim.y + threadIdx.y;
    int Col = blockIdx.x * blockDim.x + threadIdx.x;

    if (Row < numARows && Col < numBColumns)
    {
        double Cvalue = 0.0;

        for (int k = 0; k < numAColumns; ++k)
        {
            Cvalue += A[Row * numAColumns + k] * B[k * numBColumns + Col];
        }

        C[Row * numBColumns + Col] = Cvalue;
    }
}

extern "C" __global__
void sigmoidKernel(const float* in, float* out, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) {
        out[idx] = 1.0f / (1.0f + expf(-in[idx]));
    }
}

extern "C" __global__
void vectorSub(const float* a, const float* b, float* out, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) {
        out[idx] = a[idx] - b[idx];
    }
}

extern "C" __global__
void scaleVector(float* v, float scale, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) {
        v[idx] *= scale;
    }
}

extern "C" __global__
void updateWeights(float* w, const float* grad, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) {
        w[idx] -= grad[idx];
    }
}

I then wrote the orchestration inside rust,

pub fn run_ml_cuda() -> cust::error::CudaResult<()> {
    let l: f32 = 0.001; // learning rate
    let e: usize = 5000; // epochs

    let contents = fs::read_to_string("data.json").expect("Failed to read data.json");
    let data: Data = serde_json::from_str(&contents).unwrap();
    let Data { logistic, .. } = data;

    let rows = logistic.m as usize;
    let cols = logistic.n as usize;

    // Load PTX module
    let ptx = include_str!("../kernels/gradient_descent.ptx");
    let module = Module::from_ptx(ptx, &[])?;

    // Retrieve kernels
    let matrix_mul = module.get_function("matrixMulKernel")?;
    let sigmoid = module.get_function("sigmoidKernel")?;
    let vector_sub = module.get_function("vectorSub")?;
    let scale_vec = module.get_function("scaleVector")?;
    let update_w = module.get_function("updateWeights")?;

    // Allocate device buffers
    let d_X = DeviceBuffer::from_slice(&logistic.x)?;
    let d_y = DeviceBuffer::from_slice(&logistic.y)?;
    let d_w = DeviceBuffer::from_slice(&vec![0.0f32; cols])?;
    let d_lines = DeviceBuffer::<f32>::zeroed(rows)?;
    let d_prediction = DeviceBuffer::<f32>::zeroed(rows)?;
    let d_loss = DeviceBuffer::<f32>::zeroed(rows)?;
    let d_grad = DeviceBuffer::<f32>::zeroed(cols)?;

    let stream = Stream::new(StreamFlags::NON_BLOCKING, None)?;

    // Kernel launch params
    const TILE: u32 = 32;
    let block2d = (TILE, TILE, 1);
    let grid_x = ((cols as u32) + TILE - 1) / TILE;
    let grid_y = ((rows as u32) + TILE - 1) / TILE;
    let grid2d = (grid_x, grid_y, 1);

    let block1d = (256, 1, 1);
    let grid_rows = ((rows as u32) + 255) / 256;
    let grid_cols = ((cols as u32) + 255) / 256;

    let start = Instant::now();

    // Training loop
    for i in 0..e {
        // 1. lines = X * w
        unsafe {
            launch!(matrix_mul<<<grid2d, block2d, 0, stream>>>(
                d_X.as_device_ptr(),
                d_w.as_device_ptr(),
                d_lines.as_device_ptr(),
                rows as i32,
                cols as i32,
                1i32
            ))?;
        }

        // 2. prediction = sigmoid(lines)
        unsafe {
            launch!(sigmoid<<<(grid_rows,1,1), block1d, 0, stream>>>(
                d_lines.as_device_ptr(),
                d_prediction.as_device_ptr(),
                rows as i32
            ))?;
        }

        // 3. loss = prediction - y
        unsafe {
            launch!(vector_sub<<<(grid_rows,1,1), block1d, 0, stream>>>(
                d_prediction.as_device_ptr(),
                d_y.as_device_ptr(),
                d_loss.as_device_ptr(),
                rows as i32
            ))?;
        }

        // 4. gradient = X^T * loss
        unsafe {
            launch!(matrix_mul<<<(grid_x,1,1), block2d, 0, stream>>>(
                d_X.as_device_ptr(),
                d_loss.as_device_ptr(),
                d_grad.as_device_ptr(),
                cols as i32,
                rows as i32,
                1i32
            ))?;
        }

        // 5. scale gradient
        unsafe {
            launch!(scale_vec<<<(grid_cols,1,1), block1d, 0, stream>>>(
                d_grad.as_device_ptr(),
                l / rows as f32,
                cols as i32
            ))?;
        }

        // 6. update weights
        unsafe {
            launch!(update_w<<<(grid_cols,1,1), block1d, 0, stream>>>(
                d_w.as_device_ptr(),
                d_grad.as_device_ptr(),
                cols as i32
            ))?;
        }

        stream.synchronize()?;

        if i % 500 == 0 {
            println!("Iteration {} complete", i);
        }
    }

    let duration = start.elapsed();
    println!("GPU Logistic Regression Training Time: {:.2?}", duration);

    // Copy final weights back
    let mut w_host = vec![0.0f32; cols];
    d_w.copy_to(&mut w_host)?;
    println!("Final weights (first 10): {:?}", &w_host[..10]);

    Ok(())
}

And gave it a spin. To my surprise, the whole process took only 11.34 seconds. I was astonished; the same process took me 1 hour in CPU.

That's the power of parallelism.

The Cherishable Moment

Although I integrated the CUDA code in my library, I was disheartened. Speed without accuracy meant nothing. The joy was short-lived when I again ran into NaN...

I did not have to debug much; it was data type mismatch issue again. Everywhere in the CUDA code I wrote $f32$, except for the matrix multiplication inputs. I changed the remaining $f32$ calls to use $f64$ (double precision) for consistency.

After a few fixes and some more code, I finally ran the prediction and it was completing with 11 seconds for sure but with only 7.42% accuracy.

I wore the debugging hat again, for the last time. A few $f32$ needed changing, and a few matrix dimensions were wrongly set. Once all these were fixed, I was able to catch up 54% accuracy but my CPU process gave 92% accuracy.

Something was still missing. I found that the transpose function was not correctly returning the result. So, I changed the implementation. And the new implementation returned 92.85% accuracy over 20,000 iterations.

Metric	CPU Performance (Baseline)	GPU Performance (Final Fix)
Time (5000 Iterations)	≈ 1 Hour	11.34 seconds
"Accuracy (20,000 Iterations)	92%	92.85%

Finally, I could use my GPU through my Rust Library. The GPU actually aided my learning for the first time...

Part 4: The Setback and Heart Break

Palash Kanti Kundu — Thu, 04 Dec 2025 06:30:25 +0000

The GPU Matrix Multiplication program was a success; or so I thought. At that point, the inventory looked like the following:

A rudimentary Tensor Library written in Rust
Gradient Descent implementation without advanced optimizers (like Adam or RMSprop) - basic but works
A separate project with code to run Matrix Multiplication on GPU
All the wiring was done using Rust, the cust library and the external GPU kernel program

In a nutshell, I was ready to put the GPU performance tweak into my library and roll with speed.

Copy, Paste, and...

Rust Compiler's Full Blow

The moment I put the cust code into my library, the Rust compiler started screaming at me with multiple errors. The compiler correctly pointed out that DeviceCopy trait from cust library has not been implemented for my custom types.

Ah, the classic trait bound error I almost forgot after working in Python and JS for the last 14 months. Rust is so secure, it won't let me play with memory carelessly. Well, the cust library takes a step further and makes this even harder for any types which refer to raw pointers. Vec<u32> and Vec<T> obviously belong to that group, and they are the backbone of my Tensor.

I started checking the compiler errors and then began implementing the DeviceCopy for Tensor. I went through many error cycles and finally discarded all my changes.

Then it struck me, my Tensor type can't implement DeviceCopy because it has a reference type - Vec inside it. However, I used Vec to implement the matrix multiplication on GPU and that also used Vec. What was the difference? A curious me again looked around the GPU Matrix Multiplication Code and I found that, it takes a slice of the Vec<T>.

// Allocate device buffers and copy inputs
let mut d_a = DeviceBuffer::from_slice(&host_a)?;
let mut d_b = DeviceBuffer::from_slice(&host_b)?;
let mut d_c = DeviceBuffer::from_slice(&host_c)?;

There it was. If my type T is a trait that implements Copy and is non-reference by its type, then I can use the same code.

With this newfound learning, I went back to my Numeric trait and found that, it is already bound by Copy trait and all the known implementations of this trait are the primitive types which are non-referential already. I added the DeviceCopy trait bound to my Numeric trait as follows -

pub trait Numeric:
    Copy
    + DeviceCopy
    + std::ops::Add<Output = Self>
    + std::ops::Sub<Output = Self>
    + std::ops::Mul<Output = Self>
    + std::ops::Div<Output = Self>

And it threw me new error on Complex type, which is an implementation of Numeric type but does not implement DeviceCopy. I could implement this simply for my Complex struct implementation. As the members in the struct are both f64, thus already natively implemented in the original cust library.

I simply added the DeviceCopy to my original list of derive implementations of Complex type.

#[derive(Debug, PartialEq, Copy, Clone, DeviceCopy)]

Voila, no error.

So, now all the trait bounds are implemented correctly. I just need to copy the kernel files and run it.

I wrote a separate GPU Matrix Multiplication function for this -

fn _gpu_mul(&self, rhs: &Self) -> Result<Self, String> {
        let rows = self.shape[0] as usize;
        let cols = rhs.shape[1] as usize;
        let common_dim = self.shape[1] as usize;

        let mut data = vec![T::zero(); rows * cols];

        // println!("Launching GPU Kernel for Matrix Multiplication...");
        // Keep the context alive for the whole function scope.
        let _ctx = cust::quick_init();

        let d_a = DeviceBuffer::from_slice(&self.data).unwrap();
        let d_b = DeviceBuffer::from_slice(&rhs.data).unwrap();
        let d_c = DeviceBuffer::from_slice(&data).unwrap();

        let _ctx = cust::quick_init();

        // PTX produced from kernels/matrix_mul.cu
        let ptx = include_str!("../kernels/matrix_mul.ptx");
        let module = Module::from_ptx(ptx, &[]).unwrap();
        let function = module.get_function("matrix_mul").unwrap();

        let stream = Stream::new(StreamFlags::NON_BLOCKING, None).unwrap();

        // Kernel launch params (must match TILE used in the .cu)
        const TILE: u32 = 16;
        let block = (TILE, TILE, 1);
        let grid_x = ((cols as u32) + TILE - 1) / TILE;
        let grid_y = ((rows as u32) + TILE - 1) / TILE;
        let grid = (grid_x, grid_y, 1);

        unsafe {
            let result = launch!(
                function<<<grid, block, 0, stream>>>(
                    d_a.as_device_ptr(),
                    d_b.as_device_ptr(),
                    d_c.as_device_ptr(),
                    rows as i32,
                    cols as i32,
                    common_dim as i32
                )
            );

            match result {
                Ok(_) => {}
                Err(e) => return Err(format!("CUDA Kernel Launch Error: {}", e)),
            }
        }

        let result = stream.synchronize();

        match result {
            Ok(_) => {}
            Err(e) => return Err(format!("CUDA Stream Synchronization Error: {}", e)),
        }

        let result = d_c.copy_to(&mut data);

        match result {
            Ok(_) => {}
            Err(e) => return Err(format!("CUDA Device to Host Copy Error: {}", e)),
        }

        Ok(Tensor {
            shape: vec![rows as u32, cols as u32],
            data,
        })
    }

I ran the program, and to my surprise, it took longer than the original CPU-based Matrix Multiplication program. It seemed like my GPU was blocked for quite some time. Something was amiss, for sure. I halted the program after waiting for approximately two minutes. It was definitely not the "Flash" performance I expected.

The bubble just burst...

Time for another debugging session.

The Debugging Phase

Looking through the program, I noticed I was initializing the context twice in the GPU Multiplication function. I removed one instance and ran it again. Same issue again. Then I thought, 'Isn't it that my Gradient Descent runs in a loop and the multiplication function is called within that loop multiple times? This had to be the reasonable explanation: the context was being initialized multiple times, causing the delay. How about I move the initialization part somewhere else?'

I moved the initialization in the entry point itself - the main function and ran it again.

The result shattered me: It took 55 seconds - longer than CPU Multiplication, and I was back with NaN output in linear regression and 30% accuracy for logistic regression.

Debugging time...

First, I checked if my CPU Multiplication method still works. My trustworthy CPU-based Matrix Multiplication function still works and gives the same result as before.

It was definitely the GPU matrix multiplication program that was taking longer and was still computing inaccurate results.

The first change I made was to bring down the iterations to 10 and print each step in the GPU-based function.

For each multiplication, the hardware was completing calculations as follows:

GPU = ~750-800 µs
CPU = ~20 - 25 µs

Yeah, you read it right. CPU-based program rocks, GPU shocks...

I dove deeper into the results, adding more logs after each step. The logs unmasked the culprit, most of the time was taken by data copy (Host to Device and Device to Host), while only a fraction was actually spent on GPU kernel execution.

Okay, the first issue was nailed down: data transfer overhead. I chalked up a plan. If I run the whole training loop inside CUDA, I will see performance boost.

A new course of action

Copy both the input matrices from JSON to main memory
Copy the same into GPU
Run the training loop
Get the computed weight and bias matrix back from GPU
Store the results
Next time onwards use it for prediction (either through GPU or CPU)

But what about the inaccuracy part?

Let's dive a little deeper into that.

I ran all the test cases using the GPU matrix multiplication function. Almost all tests that were associated with matrix multiplication failed. I then switched back to the CPU-based matrix multiplication. All tests ended in green tick this time.

Accepting the Limits

I had already spent more than two days fixing things here and there, integrating CUDA code, and tackling related issues. I saw two major challenges: first, if I wanted to gain the speed boost, I'd need to move the whole regression module inside CUDA, otherwise time spent transporting data back and forth between CPU and GPU would eat up all the gains. To do that, I would have to write everything in C, which defeats my initial purpose of learning Rust and Machine Learning as a whole.

The second challenge was that I had become very rusty with the C language as I hadn't touched it in almost 16 years. This meant I would have to learn C thoroughly just to write the CUDA code, delaying my original learning journey with Rust and ML.

Hence, I accepted that I wouldn't work on the GPU at that stage. Down the line, if I really feel the need for it, I'll revisit the topic. Until then, I'd be content running small datasets with longer execution times.

Little did I know, the next day would bring one of my biggest facepalm moments in my programming journey...

Part 3: Accelerating Calculations using GPU

Palash Kanti Kundu — Thu, 04 Dec 2025 06:23:46 +0000

The success of the logistic regression gave me a confidence boost, but the program's performance felt like a punch to the gut. The naive matrix multiplication module was taking forever to complete even on a small dataset like (321 * 6). No programmer enjoys staring at the console for the program to finish its execution.

I have a CUDA-enabled GPU sitting in my machine gathering dust. I had avoided configuring it for development for two years out of sheer laziness. But, to scale up I had to conquer my laziness. I had to make my GPU perform the heavy math.

I spent some time searching alternatives. Many search results pointed to ndarray, a high-performance rust library to work with multi-dimensional arrays. It's a great library, but using it out of the box felt like cheating, like another black box "magic" to me. My motive was to learn from scratch; any ready-made solution would have defeated that purpose. I stumbled upon rust-cuda, a toolkit for integrating GPU programming into Rust.

Another rabbit hole opened up...

The Setup

To unlock the potential of those CUDA cores, the first step was setting up the environment. And on Windows, that was not an easy task. The setup was the single most time-consuming part of the whole process.

I spent 3 to 4 hours just to see my GPU actually doing something.

I installed CUDA Toolkit
I installed cmake
I installed MSVC 19
I installed Visual C++ build tools
I installed Microsoft Visual Studio
Finally I was able to run the cuBLAS matrix multiplication program

Phew!!!

Initial setup completed. The effort was worth it. I spent an hour playing with the CUDA Sample programs, just for the fun of it. Multiplying a 1280 * 960 matrix by a 960 * 640 one took just 0.619 msec and the GPU did not even break a sweat. I tried to go even higher - multiplying a 12800 * 9600 matrix by 9600 * 6400 matrix. I was amazed by the performance. It took only 392.912 msec.

For the first time in two years, I finally used my NVIDIA GPU the exact way I always intended to use it.

Unveiling the Secret

I was curious to know how the GPU delivers that speed. The high level answer was known - parallelization - Single Instruction Multiple Data. However, I was not sure of the mechanism. I dove deep in CUDA architecture resources including the Programming Guide, NVIDIA Developers youtube channel and many other youtube videos.

Disclaimer: I am not a hardware major and the following is a very vague understanding. I may be completely wrong on the specifics.

The GPU (device) is treated like an external device which communicates with CPU (host). If the CPU has to use the device, it first needs to get the device to first allocate memory for it. Once memory is successfully allocated on the device, the data transfer happens. Then the host asks the device to perform the intended operation. The host then passes a reference to the compiled kernel routine (the function that runs on the GPU) for the device to execute. This is done via a kernel launch. Once the kernel launch succeeds, the memory is read back from device to host.

+----------------+                                 +----------------+
|      HOST      |                                 |     DEVICE     |
|    (CPU Code)  |                                 |    (GPU Code)  |
+----------------+                                 +----------------+
        |                                                    ^
        | 1. Allocate Device Memory (cudaMalloc)             |
        |--------------------------------------------------->|
        |                                                    | (d_a, d_b, d_c pointers returned)
        | 2. Copy Data Host -> Device (cudaMemcpy)           |
        |      (e.g., a -> d_a, b -> d_b)                    |
        |--------------------------------------------------->|
        |                                                    |
        | 3. Launch Kernel                                   |
        |      (vectorSub<<<grid, block>>>(d_a, d_b, d_c, n))|
        |--------------------------------------------------->|
        |                                                    | 4. Kernel Execution
        | (Host can do other work here, if asynchronous)     |    - Multiple threads execute vectorSub
        |                                                    |    - Accessing d_a, d_b, writing to d_c
        |                                                    |<----------------------------------+
        |                                                    |                                   |
        | 5. Synchronize (cudaDeviceSynchronize)             |                                   |
        |<---------------------------------------------------|                                   |
        |                                                    | (Kernel completes, results in d_c)
        | 6. Copy Data Device -> Host (cudaMemcpy)           |
        |      (e.g., d_c => c)                              |        
        |<---------------------------------------------------|
        |                                                    |
        | 7. Free Device Memory (cudaFree)                   |
        |--------------------------------------------------->|
        |                                                    |
+----------------+                                 +----------------+

The GPU essentially manages a hierarchy of threads - Grids, Thread Blocks and Threads. Each thread executes the same kernel on different data. GPUs have the capability of launching millions of threads per second. This is where the speed comes from. The parallelization is done by the GPU based on the launch parameters like number of threads, number of blocks etc. set by the programmer.

The host launches a kernel to this massive system of threads, each thread runs the same operation. As long as each thread can work independently of others, we can run them simultaneously.

+----------------------------+
|          Grid              |
|  +---------------------+   |
|  |   Thread Block 0    |   |
|  |  [T0][T1][T2][T3]   |   |
|  +---------------------+   |
|  +---------------------+   |
|  |   Thread Block 1    |   |
|  |  [T0][T1][T2][T3]   |   |
|  +---------------------+   |
|  ...                       |
+----------------------------+

The Rust Integration Attempt

When I got bored with playing around, I took a step forward. This was the time to look for some advanced Rust stuff. There is no native library for CUDA programming provided by NVIDIA. So, I had to wire the bindings on my own. I spun up a new small project for the POC.

I started with rust-cuda. After spending hours on it without any success, I resorted to other libraries. A quick Google search gave me a few other options. I tried cust, rust-gpu and some other random internet suggestions. cust worked.

I wrote my first ever CUDA program - a simple vector subtraction.

#include <cuda.h>
#include <cuda_runtime.h>


extern "C" __global__
void vectorSub(const float* a, const float* b, float* out, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) {
        out[idx] = a[idx] - b[idx];
    }
}

// Load compiled CUDA kernels from PTX module
let ptx = include_str!("../kernels/vector_sub.ptx");
let module = Module::from_ptx(ptx, &[])?;

// Retrieve kernel function references
let sub = module.get_function("vectorSub")?;

// Allocate GPU memory buffers
let d_a = DeviceBuffer::from_slice(&a)?;    // First input vector: rows
let d_b = DeviceBuffer::from_slice(&b);         // Second input vector: rows
let d_c = DeviceBuffer::from_slice(&vec![0.0f32; rows])?; // Result vector

unsafe {
        launch!(sub<<<(grid_rows,1,1), block1d, 0, stream>>>(
            d_a.as_device_ptr(),
            d_b.as_device_ptr(),
            d_c.as_device_ptr(),
            rows as i32
        ))?;
    }

d_c.copy_to(&mut c)?;

The successful run and a correct result indicated that the Rust FFI to execute the hardware kernel actually worked. Again with another confidence boost, I started writing the next program - a naive matrix multiplication.

#include <cuda.h>
#include <cuda_runtime.h>

// CUDA kernel for matrix multiplication
extern "C" __global__ void matrix_mul(float *A, float *B, float *C, int numARows, int numAColumns, int numBColumns)
{
    int Row = blockIdx.y * blockDim.y + threadIdx.y;
    int Col = blockIdx.x * blockDim.x + threadIdx.x;

    if (Row < numARows && Col < numBColumns)
    {
        float Cvalue = 0.0;

        for (int k = 0; k < numAColumns; ++k)
        {
            Cvalue += A[Row * numAColumns + k] * B[k * numBColumns + Col];
        }

        C[Row * numBColumns + Col] = Cvalue;
    }
}

Started small: multiplying a 2 * 1 matrix by 1 * 2 matrix. I validated these results against CPU computed results. Next, I multiplied two matrices filled with random numbers and validated results and performed a few more test runs.

Once these small matrices were done, it was time to go bigger. Multiplying a big matrix of 1024 * 1024 by another 1024 * 1024 took only a few milliseconds to complete.

Conclusion

Finally, I did it. I overcame my laziness, put in the effort and achieved what I planned 2 years ago. I was very excited to integrate this setup into my main Machine Learning repo.

Only if I knew that a nightmare was waiting to happen...