Forem: zekcrates

Build a Deep Learning Library from Scratch Using NumPy (Part 5: Optimizers)

zekcrates — Sat, 17 Jan 2026 17:30:27 +0000

Introduction

In the previous post, we built the nn.Module, which gave us:

A clean way to define layers
Automatic parameter tracking
Training and evaluation modes

At this point, we can:

Build models
Compute losses
Compute gradients via backpropagation

But there’s one critical piece missing. We still don't know how to update the parameters.

Without parameter updates, our neural network is just a very expensive random number generator.

In this post We’ll build:

A base Optimizer class
SGD (Stochastic Gradient Descent)
Adam

Want to skip the series and read the full book now?

Read it for free online: https://zekcrates.quarto.pub/deep-learning-library/

How Do Gradients Update Weights?

Lets look at the simple training loop of mnist.

logits = model(x_batch)
loss = softmax_loss(logits, y_one_hot)

# clear old gradients
for p in model.parameters():
    p.grad = None

# compute gradients
loss.backward()

# update parameters
for p in model.parameters():
    p.data = p.data - lr * p.grad

What are we doing here?

loss.backward() computes gradients.
p.grad tells us which direction increases error.
We move in the opposite direction to reduce loss.

We need a training loop each time we train a model, meaning we need to update parameters each time, writing this loop always is not good.
What if we want to use some magic technique during weight updates? Do we need to mess up the whole training loop just for a single change?

Optimizer Base Class

What should every optimizer do?

Hold model parameters
Update parameters using gradients
Clear previous gradients at each step.

So we always do:

optimizer.zero_grad()
loss.backward()
optimizer.step()

class Optimizer:
    def __init__(self, params):
        self.params = params

    def zero_grad(self):
        for p in self.params:
            p.grad = None

    def step(self):
        raise NotImplementedError

Stochastic Gradient Descent (SGD)

It is a simple weight update rule used most frequently in simpler models.
The weight update rule is pretty simple too .

param = param - lr * grad

Where:

lr is the learning rate


class SGD(Optimizer):
    def __init__(self, params, lr=0.01):
        super().__init__(params)
        self.lr = lr

    def step(self):
        for p in self.params:
            if p.grad is None:
                continue
            p.data -= self.lr * p.grad

Example :

optimizer = SGD(model.parameters(), lr=0.01)

optimizer.zero_grad()
loss.backward()
optimizer.step()

SGD works and also has limitations.

Why We Need Better Optimizers

SGD does not remember how were the weights updated in the past. It has no memory of the past. Not all parameters behave the same.

Some need:

Big steps
Small updates
Momentum from past gradients

Adam

Adam tracks the gradient history.

Normal gradients for direction
Squared gradients for magnitude.

Adam uses this information to adapt each parameter’s learning rate, making weight updates smarter

class Adam(Optimizer):
    """
    Implements the Adam optimization algorithm.
    """
    def __init__(
        self,
        params,
        lr=0.001,
        beta1=0.9,
        beta2=0.999,
        eps=1e-8,
    ):
        super().__init__(params)
        self.lr = lr
        self.beta1 = beta1
        self.beta2 = beta2
        self.eps = eps

        self.t = 0
        self.m = {}
        self.v = {}

    def step(self):
        self.t += 1

        for p in self.params:
            if p.grad is None:
                continue

            grad = p.grad

            # First moment
            m = self.m.get(p, 0) * self.beta1 + (1 - self.beta1) * grad
            self.m[p] = m

            # Second moment
            v = self.v.get(p, 0) * self.beta2 + (1 - self.beta2) * (grad ** 2)
            self.v[p] = v

            m_hat = m / (1 - self.beta1 ** self.t)
            v_hat = v / (1 - self.beta2 ** self.t)

            p.data -= self.lr * m_hat / (v_hat ** 0.5 + self.eps)

Conclusion

In this post, we implemented:

Optimizer base class
SGD, the simplest optimizer
Adam, a powerful optimizer

Want to skip the series and read the full book now?

Read it for free online: https://zekcrates.quarto.pub/deep-learning-library/

Let's Build a Deep Learning Library from Scratch Using NumPy (Part 4: nn.Module)

zekcrates — Mon, 12 Jan 2026 12:24:12 +0000

Introduction

In the previous parts of this series, we built:

Part 1: The Tensor class and computation graph
Part 2: Automatic differentiation from scratch
Part 3: A simple neural network trained on MNIST

In Part 3, we manually defined each weight in our SimpleNN class. This works for small networks, but imagine building a 50-layer model you'd have to manually track every single parameter!

In this post, we’ll build the foundation of nn.Module, a system to organize layers, manage parameters, and support training and evaluation modes. This is the core of every modern deep learning library.

Missed Part 1?

Read it here: https://dev.to/zekcrates/lets-build-a-deep-learning-library-from-scratch-using-numpy-part-1-32p9

Want to skip the series and read the full book now?

Read it for free online: https://zekcrates.quarto.pub/deep-learning-library/

Parameter class

A Parameter is just a Tensor that is marked as learnable. This makes it easy to distinguish weights from intermediate tensors.

For the Tensor class the default value of requires_grad=False.

from babygrad import Tensor
class Parameter(Tensor):
    def __init__(self, data, *args, **kwargs):
        kwargs['requires_grad'] = True
        super().__init__(data, *args, **kwargs)

# Example
a = Tensor([1, 2, 3])
print(a.requires_grad)  # False

b = Parameter(a)
print(b.requires_grad)  # True

Whenever you see self.weight = Parameter(...), you immediately know it’s a learnable parameter.

Finding Parameters

Now that we have the Parameter class it would be nice if we can get all the parameters of a model.

A model might store parameters in attributes, lists, or dictionaries. To collect them automatically, we define _get_parameters().

def _get_parameters(data):
    params = []
    if isinstance(data, Parameter):
        return [data]
    if isinstance(data, dict):
        for value in data.values():
            params.extend(_get_parameters(value))
    if isinstance(data, (list, tuple)):
        for item in data:
            params.extend(_get_parameters(item))
    return params

This helper method will be used in the Module class to get all the parameters.

Module Base Class

Every layer (Linear,ReLu,BatchNorm) needs to:

Manage parameters: Find all weights inside itself
Define a forward pass: Process input data
Track training state: Know if it's training or evaluating


from typing import List

class Module:
    def __init__(self):
        self.training = True

    def parameters(self) -> List[Parameter]:
        params = _get_parameters(self.__dict__)
        unique_params = []
        seen_ids = set()
        for p in params:
            if id(p) not in seen_ids:
                unique_params.append(p)
                seen_ids.add(id(p))
        return unique_params 
    def forward(self, *args, **kwargs):
        raise NotImplementedError

    def __call__(self, *args, **kwargs):
        return self.forward(*args, **kwargs)

Now whenever we have defined a model we can just use model.parameters().

We also need to toggle self.training. So we will use a helper method _get_modules() that will find all the modules present in the model and then toggle self.training.

def _get_modules(obj) -> list['Module']:
    modules = []
    if isinstance(obj, Module):
        return [obj]

    if isinstance(obj, dict):
        for value in obj.values():
            modules.extend(_get_modules(value))

    if isinstance(obj, (list, tuple)):
        for item in obj:
            modules.extend(_get_modules(item))

    return modules 

class Module:
    # code
    def train(self):
        self.training = True 
        for m in _get_modules(self.__dict__):
            m.training = True 
    def eval(self):
        self.training = False 
        for m in _get_modules(self.__dict__):
            m.training = False

We also added new methods (train,eval) inside the Module class.

Now that we have our base class done we can finally create some decent layers .

Stateless Layers: ReLU, Sigmoid, Tanh, Flatten

Some layers don’t have learnable parameters. They just apply a function to the input.

class ReLU(Module):
    def forward(self, x):
        return ops.relu(x)
class Sigmoid(Module):
    def forward(self, x):
        return ops.sigmoid(x)
class Tanh(Module):
    def forward(self, x):
        return ops.tanh(x)
class Flatten(Module):
    def forward(self, x):
        batch_size = x.shape[0]
        return x.reshape(batch_size, -1)

NOTE: ops.somefunction(x) was covered in PART 2.

Linear Layer

This layer is the most basic layer that can do a lot of magic.
It is stateful and has a weight and bias that we need to learn.

class Linear(Module):
    def __init__(self, in_features: int, out_features: int, bias: bool = True,
             dtype: str = "float32"):
        super().__init__()
        self.in_features = in_features
        self.out_features = out_features
        self.weight = Parameter(Tensor.randn(in_features, out_features))
        self.bias = None
        if bias:
            self.bias = Parameter(Tensor.zeros(1, out_features))

    def forward(self, x: Tensor) -> Tensor:
        # (bs,in) @ (in,out) -> (bs,out)
        out = x @ self.weight
        if self.bias is not None:
            # (1,out) -> (bs,out) #broadcasted
            out += self.bias.broadcast_to(out.shape)
        return out

Sequential: Stacking Layers

If we have multiple modules, it would be a complex task to call forward on each of the modules

class MyModel(Module):
    def __init__(self):
        super().__init__()
        self.w1 = Linear(10, 20)
        self.w2 = Linear(20, 30)
        self.relu = ReLU()
        self.final = Linear(30, 10)
    def forward(self, x):
        x = self.w1(x)
        x = self.relu(x)
        x = self.w2(x)
        x = self.relu(x)
        x = self.final(x)
        return x

The Sequential solves this problem by chaining modules together automatically.
The output of one layer becomes the input to the next.

class Sequential(Module):
    def __init__(self, *modules):
        super().__init__()
        self.modules = modules
    def forward(self, x):
        for m in self.modules:
            x = m(x)
        return x

Now we can simply do

model = Sequential(
    Linear(10, 20),
    ReLU(),
    Linear(20, 30),
    ReLU(),
    Linear(30, 10)
)
logits = model(x)

MSE Loss


class MSELoss(Module):
    def forward(self, pred: Tensor, target: Tensor) -> Tensor:
        """
        Calculates the Mean Squared Error.
        """
        diff = pred - target
        sq_diff = diff * diff
        return sq_diff.sum() / Tensor(target.data.size)

Conclusion

In this post, we built the core nn.Module abstraction that lets us define layers, manage parameters automatically, and compose models cleanly.

With this foundation in place, we can now focus on training instead of bookkeeping.

In the next post, we’ll implement optimizers and use them to train models built with nn.Module

More Layers (BatchNorm,LayerNorm,Dropout) are covered in the book!

Read it for free online: https://zekcrates.quarto.pub/deep-learning-library/

Let’s Build a Deep Learning Library from Scratch Using NumPy (Part 3: Training MNIST)

zekcrates — Fri, 09 Jan 2026 11:16:10 +0000

Introduction

In Part 1, we built the Tensor class and a computation graph.
In Part 2, we implemented automatic differentiation from scratch.

In this part, we will:

Use our custom autograd engine (babygrad)
Build a small neural network
Train it on the MNIST handwritten digits dataset

Missed Part 1?

Read it here: https://dev.to/zekcrates/lets-build-a-deep-learning-library-from-scratch-using-numpy-part-1-32p9

Want to skip the series and read the full book now?

Read it for free online: https://zekcrates.quarto.pub/deep-learning-library/

Loading MNIST data

You can easily download MNIST data and the files look like this

# data/ 
t10k-images-idx3-ubyte.gz
t10k-labels-idx1-ubyte.gz
train-images-idx3-ubyte.gz
train-labels-idx1-ubyte.gz

The images file should return (num_images,784) and labels files should return (num_images) where labels are in the range 0–9.

import struct
import gzip
import numpy as np

def parse_mnist(image_filename, label_filename):
    with gzip.open(image_filename, 'rb') as f:
        magic, num_images, rows, cols = struct.unpack('>IIII', f.read(16))
        image_data = np.frombuffer(f.read(), dtype=np.uint8)
        images = image_data.reshape(num_images, rows * cols)

    with gzip.open(label_filename, "rb") as f:
        magic, num_labels = struct.unpack('>II', f.read(8))
        labels = np.frombuffer(f.read(), dtype=np.uint8)

    images = images.astype(np.float32) / 255.0
    return images, labels

Now that we have our data lets create a simple model that we will train on the data.

It will have only 2 weights(W1,W2)

W1 = (784, 100)
W2 = (100, 10)

from babygrad import Tensor, ops

class SimpleNN:
    def __init__(self, input_size, hidden_size, num_classes):
        self.W1 = Tensor(
            np.random.randn(input_size, hidden_size).astype(np.float32)
            / np.sqrt(hidden_size),
            requires_grad=True
        )
        self.W2 = Tensor(
            np.random.randn(hidden_size, num_classes).astype(np.float32)
            / np.sqrt(num_classes),
            requires_grad=True
        )
    def forward(self, x):
        z1 = x @ self.W1
        a1 = ops.relu(z1)
        logits = a1 @ self.W2
        return logits
    def parameters(self):
        return [self.W1, self.W2]

The model will take an image of size (784,) first go through W1

(5,784) @ (784,100) -> (5,100)
x @ self.W1

#Note: The '@' is our `matmul` function defined in the previous part.

Note: The @ operator uses our custom matmul op implemented in Part 2.

Now the shape of the image after passing through W1 is (5,100).

We only have 10 labels (digits 0–9) so that means the model predicts one value out of those 10.

We will send the above result to W2.

(5,100) @ (100,10) = (5,10)
logits = a1 @ self.W2

Now we have (5,10).

The output (5, 10) contains raw class scores (logits) for each digit.

But logits alone aren’t enough we need a loss function.

This is the loss which we will decrease by updating our (W1,W2) by using their gradients.

Loss function

def softmax_loss(logits: Tensor, y_true: Tensor) -> Tensor:
    batch_size = logits.shape[0]
    log_sum_exp = ops.log(ops.exp(logits).sum(axes=1))
    z_y = (logits * y_true).sum(axes=1)
    loss = log_sum_exp - z_y
    return loss.sum() / batch_size

This gives us a single loss value.

We now have everything

Data
Model
Loss function

The only thing left is to train this model using the data . We can do this by adding a training loop.

Training loop

def train_epoch(model, X_train, y_train, lr, batch_size):
    for i in range(0, X_train.shape[0], batch_size):
        x_batch = Tensor(X_train[i:i+batch_size])
        y_batch_np = y_train[i:i+batch_size]

        logits = model.forward(x_batch)

        num_classes = logits.shape[1]
        y_one_hot = np.zeros((y_batch_np.shape[0], num_classes),
                             dtype=np.float32)
        y_one_hot[np.arange(y_batch_np.shape[0]), y_batch_np] = 1
        y_one_hot = Tensor(y_one_hot)

        loss = softmax_loss(logits, y_one_hot)

        # Zero gradients
        for p in model.parameters():
            p.grad = None

        # Backprop
        # Gradients are calculated.
        loss.backward()

        # Parameters (w1,w2) updated using gradients
        for p in model.parameters():
            p.data -= lr * p.grad

        preds = logits.data.argmax(axis=1)
        acc = np.mean(preds == y_batch_np)

        print(
            f"Loss: {loss.data:.4f}, Accuracy: {acc*100:.2f}%"
        )

This loop:

Builds the computation graph.
Calls backward().
Updates parameters using gradients.

After training for some time.

  Batch  13: Loss = 0.2163, Accuracy = 96.09%
  Batch  14: Loss = 0.1742, Accuracy = 96.09%
  Batch  15: Loss = 0.1630, Accuracy = 96.88%
  Batch  16: Loss = 0.1862, Accuracy = 95.31%
  Batch  17: Loss = 0.1637, Accuracy = 96.09%
  Batch  18: Loss = 0.1812, Accuracy = 95.31%
  Batch  19: Loss = 0.2156, Accuracy = 94.53%
  Batch  20: Loss = 0.1259, Accuracy = 99.22%

Conclusion

At this point, we’ve successfully trained a neural network on MNIST using an autograd engine built entirely from scratch.

This is the core of every modern deep learning library.
Everything that comes next optimizers, deeper networks, CNNs will be built
on top of this same foundation

Want to skip the series and read the full book now?

Read it for free online: https://zekcrates.quarto.pub/deep-learning-library/

Let’s Build a Deep Learning Library from Scratch Using NumPy (Part 2: Autograd)

zekcrates — Thu, 08 Jan 2026 15:50:36 +0000

Introduction

Welcome back! In Part 1, we built the foundation of our deep learning library babygrad:

Defined the Tensor class.
Tracked operations and inputs.

Now, it’s time to implement automatic differentiation, the core of deep learning. This is what allows libraries like PyTorch to compute gradients
automatically so we can train models.

Missed Part 1?

Read it here: https://dev.to/zekcrates/lets-build-a-deep-learning-library-from-scratch-using-numpy-part-1-32p9

Want to skip the series and read the full book now?

Read it for free online: https://zekcrates.quarto.pub/deep-learning-library/

What is a Gradient?

It is a measure of how much the final error changes when we change a specific input.

What is Automatic Differentiation?

Automatic differentiation works by:

Updating a computation graph as we perform functions on Tensors.
Breaking complex functions into simpler operations.
Using the chain rule to propagate gradients from outputs back to inputs.

a = Tensor([2.0], requires_grad=True)
b = Tensor([3.0], requires_grad=True)
c = a * b

If we call c.backward() it will use the chain rule to calculate gradients with respect to the parents.

# derivative of a*b wrt to a is b . 
# derivative of a*b wrt to b is a. 
dc_da = b
dc_db = a

Just like multiplication, we can have many different operations in a computation graph, each responsible for computing its own gradients using the chain rule.

Using just a few simple operations like this, and defining their backward passes, we can surprisingly build and train full models.

Functions: Forward and Backward

Each operation in babygrad is a class (like Add or Mul) that implements:

forward: calculates the result from inputs
backward: calculates the gradients of the inputs

As each Function will have the same methods(forward,backward) Lets define a common base class called Function.

class Function:
    def __call__(self, *inputs):
        requires_grad = any(t.requires_grad for t in inputs)
        #"inputs" are Tensors, we call ".data" to get the numpy array
        inputs_data = [t.data for t in inputs]
        # Using the inputs_data we call "forward" method that each 
         # subclass will implement of its own.
        output_data = self.forward(*inputs_data)
        # Wrap around in Tensor.
        output_tensor = Tensor(output_data, requires_grad=requires_grad)

        if requires_grad:
            # Who are its parents? 
            output_tensor._op = self
            output_tensor._inputs = inputs
        return output_tensor

    def forward(self, *args):
        raise NotImplementedError()

    def backward(self, out_grad, node):
        # node is the input from forward method.
        # out_grad is the upstream gradient
        raise NotImplementedError()

Any Function like Add or Mul will be a subclass of this Function. When we call the Function, under the hood it will call Function.forward() method and create the new Tensor.

Lets see some examples of this Function class in action to understand better.

class Add(Function):
    def forward(self, a, b):
        return a + b
    def backward(self, out_grad, node):
        #gradient of a+b with respect to a is 1 
        # gradient of a+b wrt to b is 1
        # we have to pass upstream gradient(out_grad) to our parents
        # so they can find their own gradients too.
        return out_grad* 1, out_grad* 1
def add(a, b):
    return Add()(a, b)

Each backward function will take out_grad which is the upstream gradient. The gradients which are returned in each backward function are called local gradients.
These local gradients are multiplied by the upstream gradient (out_grad) to apply the chain rule.

Each operation in the computation graph will calculate its local derivatives and that's it.

Just a little change inside the Tensor class and we can override the python + operator to use our function add(a,b) instead of default function.

class Tensor:
    # overrides "+" with our function.
    def __add__(self,other):
      if not isinstance(other, Tensor):
        other = Tensor(other)
      return add(self, other)

a = Tensor([2.0], requires_grad=True)
b = Tensor([3.0], requires_grad=True)
c = a+b # It will use our **add** function
print(c.data)      # [3,5,7]
print(c._op)       # <Add object>
print(c._inputs)   # [a, b]

You should implement all the other Functions available in the book.

Sub
Div
Pow
Transpose
Reshape
BroadcastTo
Summation
MatMul
Negate
Log
Exp
Relu
Sigmoid
Tanh
Sqrt
abs

Most of them are pretty basic and simple derivatives. For the forward pass if there is a NumPy function available use that.

The most rewarding functions will be reshape, broadcast_to ,
Matmul,Summation because most of the heavy lifting will be done by them.

Now that we have written our Functions and have overridden them inside our Tensor class . One final method we must implement inside our class will be backward class.

When we do loss.backward() , this backward() function we will implement inside Tensor class.

Backward Pass

The backward pass computes gradients from outputs back to inputs using the chain rule.

Each backward method knows its local derivative.
Gradients are accumulated from all child nodes.

When we call loss.backward() or c.backward() from above. what should be the gradient of loss or c be? Each Function.backward() takes out_grad and node.
What should be the first out_grad?

Simple! It should just be 1 Because changing the output will affect the output by 1 This is our upstream gradient.

We have already mentioned that we will use Computation graph for backward pass. Before even calling loss.backward() our Computation graph has already been constructed. We just need to move from output to input and call backward on each.

So we need 2 operations

Traverse the computation graph
Call node._op.backward for all nodes.

Traverse

We will use DFS to traverse the computation graph.

class Tensor:
    def backward(self, grad=None):
        if not self.requires_grad:
            raise RuntimeError(
    "Cannot call backward on a tensor that does not require gradients."
)

         # Store the nodes from the graph
        topo_order = []
        visited = set()
        def build_topo(node):
            if id(node) not in visited:
                visited.add(id(node))
                for parent in node._inputs:
                    build_topo(parent)
                topo_order.append(node)
        build_topo(self)

a = babygrad.Tensor([1,2,3]) 
b =babygrad.Tensor([2,3,4]) 
c = a+ b 
d = babygrad.Tensor([3,4,5]) 
e = c+d 
e.backward()
# first traverses the computation graph 
# topo_order = [a,b,c,d,e]

For the above equations we get the topo_order=[a,b,c,d,e].
All the nodes participating in the graph are present in the topo_order.
Now we just need to use this topo_order in order to calculate gradients for all.

Calculating gradients

class Tensor:
    def backward(self, grad=None):
        if not self.requires_grad:
            raise RuntimeError(
    "Cannot call backward on a tensor that does not require gradients."
)


        # Build the "Family Tree" in order (Topological Sort)
        topo_order = []
        visited = set()
        def build_topo(node):
            # done above 
        build_topo(self)

        # Initialize the Ledger
        grads = {}
        if grad is None:
            # The "output" gradient: dL/dL = 1
            grads[id(self)] = Tensor(np.ones_like(self.data))
        else:
            grads[id(self)] = _ensure_tensor(grad)

        # Walk the Graph Backwards
        for node in reversed(topo_order):
            out_grad = grads.get(id(node))
            if out_grad is None:
                continue

            # Store the final result in the .grad attribute
            if node.grad is None:
                node.grad = np.array(out_grad.data, copy=True)

            else:
                node.grad += out_grad.data

            # Propagate to Parents
            if node._op:
                # finally calling node._op.backward()
                input_grads = node._op.backward(out_grad, node)
                if not isinstance(input_grads, tuple):
                    input_grads = (input_grads,)

                for i, parent in enumerate(node._inputs):
                    if parent.requires_grad:
                        parent_id = id(parent)
                        if parent_id not in grads:
                            # First time seeing this parent
                            grads[parent_id] = input_grads[i]
                        else:
                            #  Sum the gradients!
                            grads[parent_id] = grads[parent_id] +                                                         input_grads[i]

A few questions

Why traverse the nodes in reverse?

In the backward pass we go from output to inputs or children to parents. We would first calculate the gradients for children and then of the parents.

Why use grads = {} ?

In the loop we already have out_grad gradient of the current node we then call node._op.backward(out_grad,node) to get the gradients of node's parents.

Thats cute we have calculated the gradients for parents but we haven't stored them yet in parent.grad.

So to do that we need to store them somewhere thats where the Ledger comes in. And in the loop when the node is equal to parent node then node.grad = np.array(out_grad.data, copy=True) this line is storing parent.grad.

At this point, we’ve built a complete autograd engine from scratch.

Everything that comes next: optimizers, neural networks, training loops,
and datasets will be built on top of this exact system.

All the other things that will come next will be hugely dependent on the code written here.

What's next?

In the next part we will use the code we have written to train MNIST.

Want to skip the series and read the full book now?

Read it for free online: https://zekcrates.quarto.pub/deep-learning-library/

Let’s Build a Deep Learning Library from Scratch Using NumPy (Part 1)

zekcrates — Mon, 05 Jan 2026 13:28:41 +0000

Introduction

We are going to build our own PyTorch-like deep learning library from scratch. We will call it babygrad. We are starting with a blank file and NumPy, and we won't stop until we have a functional autograd engine and train some decent models (MNIST,CNN) using it.

What This Series Is About

This is not a deep learning “how to use libraries” tutorial.

Instead, we’ll:

Start from a blank Python file
Wrap NumPy arrays
Track operations
Build a computation graph
Implement backpropagation ourselves

Want to skip the series and read the full book now?

Read it for free online: https://zekcrates.quarto.pub/deep-learning-library/

What is a Tensor?

In any deep learning library(PyTorch),Tensor is the fundamental building block. Tensor is a wrapper around Numpy Arrays. Think of NumPy as the raw data and Tensor as the container that holds this raw data and also remembers the history of its parents . This history is important when we will do the backpropagation.

a = babygrad.Tensor([1,2,3])
b = babygrad.Tensor([1,2,3])
c = a+b 
print(c._inputs)
>>> [Tensor(1,2,3), Tensor[1,2,3]] 
print(c.op)
>>> <babygrad.ops.Add object at 0x7f0cfcfcc3a0>

Who created C? A and B. So they become the inputs of C
What happened between A and B that lead to C? The + operation.
This becomes the op of C. These have now become part of the Computation graph which will be used to do backpropagation.

What is a Computation graph?

A graph that shows

Numbers (Tensors) as nodes.
Operations (ops) as nodes.
Edges showing how data flows from inputs → operations → outputs.

Implementing the Tensor class.

Let's look at the backbone of our library. A Tensor needs to track its data, its gradient, and its parents(if any).
A Tensor needs to:

Store its data
Store its gradient (computed later)
Know whether it should track gradients
Remember how it was created


import numpy as np
NDArray = np.ndarray
def _ensure_tensor(val):
    return val if isinstance(val, Tensor) else Tensor(val,
     requires_grad=False)
class Tensor:
    def __init__(self, data, *, device=None, dtype="float32",
     requires_grad=False):
        if isinstance(data, Tensor):
            if dtype is None:
                dtype = data.dtype
            self.data = data.numpy().astype(dtype)
        elif isinstance(data, np.ndarray):
            self.data = data.astype(dtype if dtype is not None else data.dtype)
        else:
            self.data = np.array(data, dtype=dtype if dtype is not None
             else "float32")
        self.grad = None
        self.requires_grad = requires_grad
        self._op = None       
        self._inputs = []     
        self._device = device if device else  "cpu"    

    @property
    def shape(self):
        return self.data.shape
    @property
    def dtype(self):
        return self.data.dtype
    @property
    def ndim(self):
        return self.data.ndim
    @property
    def size(self):
        return self.data.size    
    @property
    def device(self):
        return self._device        
    def __repr__(self):
        return f"Tensor({self.data}, requires_grad={self.requires_grad})"    
    def __str__(self):
        return str(self.data)
    def backward(self):
        # We will do this in the next part.

No matter what the input data is , we must always convert the input data into NDArray.

The input data could be

A Tensor
An NDArray
A List

All of these inputs must be converted to NDArray no matter what.

What is requires_grad?
requires_grad controls whether a Tensor participates in the computation graph.

requires_grad=True → gradients will be tracked
requires_grad=False → no gradients will be computed

Simple Methods for Tensor class

We will like to introduce some simple methods for the Tensor class that will come in handy in the future.

.numpy()

When we create a Tensor we would like to extract the raw Numpy array from the Tensor without screwing anything.

class Tensor: 
    def numpy(self):
        return self.data.copy()

.detach()

Sometimes we would like to have the same Tensor but it should not be a part of Computation graph. So we just create a new Tensor with the same data and requires_grad=False.

class Tensor:
    def detach(self):
        return Tensor(self.data, requires_grad=False)

What's Next?

We've laid the foundation!
In Part 2, we'll implement the heart of deep learning: Autograd
https://dev.to/zekcrates/lets-build-a-deep-learning-library-from-scratch-using-numpy-part-2-autograd-i17

Liked it?

Read it for free online: https://zekcrates.quarto.pub/deep-learning-library/

Learn to build a Deep Learning library from scratch in Python and NumPy (autograd, CNNs, ResNets) [free]

zekcrates — Thu, 01 Jan 2026 09:04:49 +0000

Read it for free here: https://zekcrates.quarto.pub/deep-learning-library/
I created this project to strip away the "black box" of modern frameworks and implement the core stuff from a blank file using only Python and NumPy.

What You’ll Build

Autograd Engine – automatic differentiation from scratch
Neural Network Modules – layers, activations, and loss functions
Optimizers – SGD, Adam
Model Persistence – save and load trained models
Training Loop – a clean, reusable trainer
Datasets & Dataloaders – batching, shuffling, iteration
Parameter Initialization – common initialization strategies
Convolutional Neural Networks (CNNs) – build and train conv nets

What You’ll Train (Using the Library)

MNIST – fully train a neural network from scratch
Simple CNN on MNIST
CNN on CIFAR-10
Simple ResNet on CIFAR-10

The project is intended as a conceptual and fun reference rather than a production framework.

Feedback on correctness, scope, or missing pieces would be very welcome.

Read it for free here: https://zekcrates.quarto.pub/deep-learning-library/