DEV Community

Cover image for Understanding Backpropagation and Gradient Descent in Neural Networks
Manas Patil
Manas Patil

Posted on • Edited on

Understanding Backpropagation and Gradient Descent in Neural Networks

Suppose you're building a neural network, maybe even a deep and complex one. You’ve set up the layers, initialized weights, and defined activation functions. But here's a question:

Can this neural network make accurate predictions without tuning?
The short answer: No.

Building a neural network is just the start, the real magic lies in fine-tuning its parameters (i.e., weights and biases) so it actually learns from the data.


Why Do We Need to Optimize Weights and Biases?

To make better predictions, we want the network to minimize the loss and maximize accuracy.

Fine-tuning means updating weights and biases over multiple epochs using feedback from the output (loss) to improve the next prediction. This continues until we reach a point of minimal loss.

But how exactly do we update the weights and biases?


Early (Inefficient) Ideas for Updating Weights

1. Random Weights & Biases

Image description
Try random values, compute the loss, and repeat the process until the lowest loss is achieved.

❌ This is inefficient and slow.

2. Guided Random Tweaks

Image description
Set random weights → calculate loss → try new weights close to the previous ones if loss decreases → stop if it doesn’t.

✅ Better than the first, but still not optimal.

Now let’s go one level higher...


Enter: Backpropagation + Gradient Descent

Let’s say we want to go downhill on a loss curve to reach the lowest point (minimum loss). To do this efficiently, we need to know:

  • The direction to move in → determined by the slope (derivative)
  • How much to move → controlled by the learning rate

This is where calculus enters the picture.


Gradient Descent — The Update Rule

To minimize the loss L, we update the weights and biases using:

w=wηLw w = w - \eta \cdot \frac{\partial L}{\partial w}

b=bηLb b = b - \eta \cdot \frac{\partial L}{\partial b}

Where:

Lw\frac{\partial L}{\partial w} = derivative of the loss with respect to weight

Lb\frac{\partial L}{\partial b} = derivative of the loss with respect to bias

But how do we get these derivatives?

That’s where Backpropagation comes in.

Backpropagation: Going Back to Learn Better

Let’s take an example:

  • A neural network with 3 neurons in the hidden layer
  • A single output neuron in the final layer
  • Loss function: Mean Squared Error L=(ypredytrue)2L = (y_{pred}-y_{true})^2

Image description

To update the weights, we apply the chain rule from calculus to compute the gradients.


Gradients for Hidden Layer Weights

Neuron 1

Lw11=Lyya1a1z1z1w11\frac{\partial L}{\partial w_{11}}=\frac{\partial L}{\partial y} * \frac{\partial y}{\partial a_1} * \frac{\partial a_1}{\partial z_1} * \frac{\partial z_1}{\partial w_{11}}

Lw12=Lyya1a1z1z1w12\frac{\partial L}{\partial w_{12}}=\frac{\partial L}{\partial y} * \frac{\partial y}{\partial a_1} * \frac{\partial a_1}{\partial z_1} * \frac{\partial z_1}{\partial w_{12}}

Lw13=Lyya1a1z1z1w13\frac{\partial L}{\partial w_{13}}=\frac{\partial L}{\partial y} * \frac{\partial y}{\partial a_1} * \frac{\partial a_1}{\partial z_1} * \frac{\partial z_1}{\partial w_{13}}

Lw14=Lyya1a1z1z1w14\frac{\partial L}{\partial w_{14}}=\frac{\partial L}{\partial y} * \frac{\partial y}{\partial a_1} * \frac{\partial a_1}{\partial z_1} * \frac{\partial z_1}{\partial w_{14}}

Neuron 2

Lw21=Lyya2a2z2z2w21\frac{\partial L}{\partial w_{21}}=\frac{\partial L}{\partial y} * \frac{\partial y}{\partial a_2} * \frac{\partial a_2}{\partial z_2} * \frac{\partial z_2}{\partial w_{21}}

Lw22=Lyya2a2z2z2w22\frac{\partial L}{\partial w_{22}}=\frac{\partial L}{\partial y} * \frac{\partial y}{\partial a_2} * \frac{\partial a_2}{\partial z_2} * \frac{\partial z_2}{\partial w_{22}}

Lw23=Lyya2a2z2z2w23\frac{\partial L}{\partial w_{23}}=\frac{\partial L}{\partial y} * \frac{\partial y}{\partial a_2} * \frac{\partial a_2}{\partial z_2} * \frac{\partial z_2}{\partial w_{23}}

Lw24=Lyya2a2z2z2w24\frac{\partial L}{\partial w_{24}}=\frac{\partial L}{\partial y} * \frac{\partial y}{\partial a_2} * \frac{\partial a_2}{\partial z_2} * \frac{\partial z_2}{\partial w_{24}}

Neuron 3

Lw31=Lyya3a3z3z3w31\frac{\partial L}{\partial w_{31}}=\frac{\partial L}{\partial y} * \frac{\partial y}{\partial a_3} * \frac{\partial a_3}{\partial z_3} * \frac{\partial z_3}{\partial w_{31}}

Lw32=Lyya3a3z3z3w32\frac{\partial L}{\partial w_{32}}=\frac{\partial L}{\partial y} * \frac{\partial y}{\partial a_3} * \frac{\partial a_3}{\partial z_3} * \frac{\partial z_3}{\partial w_{32}}

Lw33=Lyya3a3z3z3w33\frac{\partial L}{\partial w_{33}}=\frac{\partial L}{\partial y} * \frac{\partial y}{\partial a_3} * \frac{\partial a_3}{\partial z_3} * \frac{\partial z_3}{\partial w_{33}}

Lw34=Lyya3a3z3z3w34\frac{\partial L}{\partial w_{34}}=\frac{\partial L}{\partial y} * \frac{\partial y}{\partial a_3} * \frac{\partial a_3}{\partial z_3} * \frac{\partial z_3}{\partial w_{34}}

Note: For example, w21w_{21} flows through neuron 2, so we apply the chain rule using neuron 2’s activation and output.


Gradients for Biases

Bias of Neuron 1

Lb1=Lyya1a1z1z1b1\frac{\partial L}{\partial b_1}=\frac{\partial L}{\partial y} * \frac{\partial y}{\partial a_1} * \frac{\partial a_1}{\partial z_1} * \frac{\partial z_1}{\partial b_1}

Bias of Neuron 2

Lb2=Lyya2a2z2z2b2\frac{\partial L}{\partial b_2}=\frac{\partial L}{\partial y} * \frac{\partial y}{\partial a_2} * \frac{\partial a_2}{\partial z_2} * \frac{\partial z_2}{\partial b_2}

Bias of Neuron 3

Lb3=Lyya3a3z3z3b3\frac{\partial L}{\partial b_3}=\frac{\partial L}{\partial y} * \frac{\partial y}{\partial a_3} * \frac{\partial a_3}{\partial z_3} * \frac{\partial z_3}{\partial b_3}

🎯 We calculate these gradients by moving backward through the network — and that’s why the algorithm is called Backpropagation.


Applying Gradient Descent

Image description
Once we compute the gradients, we use the gradient descent update rule:

w=wηLw w = w - \eta \cdot \frac{\partial L}{\partial w}

b=bηLb b = b - \eta \cdot \frac{\partial L}{\partial b}

Where 0.05 is an example learning rate (you can adjust it).

After applying the update:

  • ✅ Loss decreases
  • ✅ Accuracy improves

Repeat this process across epochs to gradually optimize the model.


Before vs After Optimization

Image description
📍 Before Gradient Descent: You're sitting somewhere randomly on the loss curve.
📍 After One Update: You’ve moved closer to the local minimum.

This is the power of gradient descent — it helps your model learn how to learn.


Why This Matters

The heart of deep learning is optimization.
And the heart of optimization is:

Backpropagation + Gradient Descent

Without these, neural networks would just be complex calculators spitting out random values.

Thanks to backpropagation, networks can learn from their mistakes, and thanks to gradient descent, they can improve continuously.

Sentry image

Make it make sense

Make sense of fixing your code with straight-forward application monitoring.

Start debugging →

Top comments (0)

Build seamlessly, securely, and flexibly with MongoDB Atlas. Try free.

Build seamlessly, securely, and flexibly with MongoDB Atlas. Try free.

MongoDB Atlas lets you build and run modern apps in 125+ regions across AWS, Azure, and Google Cloud. Multi-cloud clusters distribute data seamlessly and auto-failover between providers for high availability and flexibility. Start free!

Learn More

👋 Kindness is contagious

Take a moment to explore this thoughtful article, beloved by the supportive DEV Community. Coders of every background are invited to share and elevate our collective know-how.

A heartfelt "thank you" can brighten someone's day—leave your appreciation below!

On DEV, sharing knowledge smooths our journey and tightens our community bonds. Enjoyed this? A quick thank you to the author is hugely appreciated.

Okay