Forem: Rijul Rajesh

Understanding Reinforcement Learning with Neural Networks Part 5: Connecting Reward, Derivative, and Step Size

Rijul Rajesh — Fri, 15 May 2026 20:33:05 +0000

In the previous article, we explored the reward system in reinforcement learning

In this article, we will begin calculating the step size.

First Update

In this example, the learning rate is 1.0.

So, the step size is 0.5.

Next, we update the bias by subtracting the step size from the old bias value 0.0:

After the Update

Now that the bias has been updated, we run the model again.

The new probability of going to Place B becomes 0.4.

This means the probability of going to Place A is:

Choosing Again

We now pick a random number between 0 and 1, and get 0.9.

Since 0.9 falls in the region representing Place B, we choose Place B.

Computing the Gradient Again

To update the bias, we again compute the derivative.

First, we assume that choosing Place B was the correct action.

So ideally:

Now we compute the difference between the ideal value 1.0 and the actual value 0.4.

Using this, we calculate the derivative with respect to the bias, which gives:

Checking the Reward

Now we check whether this was actually a good decision.

Place B gives a large portion of fries, but our hunger input is 0.0, meaning we are not very hungry.

So this was not a good choice.

Therefore, the reward is:

Reward = -1

Updating with Reward

We multiply the derivative by the reward:

-0.6 x -1 = 0.6

So the updated derivative becomes 0.6.

Second Step Update

Now we calculate the step size again:

Final Result

We plug the new bias back into the neural network.

Now the probability of going to Place B has decreased.

This means that when hunger is low, the model is more likely to choose Place A, which is the correct behavior.

This shows that the reinforcement learning algorithm, specifically policy gradients, is working as expected.

In the next article, we will explore how to further train the model using different input values.

Looking for an easier way to install tools, libraries, or entire repositories?
Try Installerpedia: a community-driven, structured installation platform that lets you install almost anything with minimal hassle and clear, reliable guidance.

Just run:

ipm install repo-name

… and you’re done! 🚀

🔗 Explore Installerpedia here

Understanding Reinforcement Learning with Neural Networks Part 4: Positive and Negative Rewards

Rijul Rajesh — Wed, 13 May 2026 20:46:18 +0000

In the previous article, we began the process of guessing the ideal output.

Let us continue with the same example.

Suppose we receive a small number of fries.

Since our hunger level is 0, this is actually a good outcome.

In this case, we should assign a reward of 1.

Now consider the opposite situation.

Suppose we receive a large order of fries.

Since we are not hungry enough to eat all the fries, this means we made a poor decision.

In that case, we assign a reward of -1.

In general:

Any positive reward indicates a good decision
Any negative reward indicates a bad decision

Updating the Derivative with the Reward

We now use this reward to update the derivative.

To do this, we simply multiply the derivative by the reward.

Case 1: Correct Decision

If the reward is 1, then:

The derivative remains unchanged.

This means the derivative is already pointing in the correct direction.

Case 2: Incorrect Decision

If the reward is -1, then:

Now the derivative changes sign.

This causes the optimization process to move the bias in the opposite direction.

In other words, the negative reward flips the direction of the update so the neural network can learn from the bad decision.

In the next article, we will explore how to calculate the step size for updating the parameters.

Just run:

ipm install repo-name

… and you’re done! 🚀

🔗 Explore Installerpedia here

Understanding Reinforcement Learning with Neural Networks Part 3: Guessing the Ideal Output

Rijul Rajesh — Mon, 11 May 2026 18:48:10 +0000

In the previous article, we explored the limitations of backpropagation and why it is not ideal when the correct output values are unknown.

In this article, we will begin exploring the core ideas behind reinforcement learning.

Starting Example

Let us begin by assuming that we are not hungry.

We will feed the value 0.0 into the neural network.

The neural network outputs a probability of 0.5 for going to Place B.

So:

Probability of going to Place B = p(B) = 0.5
Probability of going to Place A = 1 - p(B) = 0.5

Visualizing the Probabilities

We can represent these probabilities using a line.

First, we draw a line segment with length 0.5 to represent the probability of going to Place A.

Then, we append another line segment to represent the probability of going to Place B.

Together, these form a line ranging from 0 to 1.

Choosing an Action

To decide which place to go for a snack, we randomly pick a number between 0 and 1.

Let us pick 0.2.

Since 0.2 falls inside the region representing Place A, we choose to go to Place A.

Making a Guess About the Correct Action

Now, let us assume that going to Place A when hunger = 0 was the correct decision.

Ideally:

The probability of going to Place A, p(A), should be 1
The probability of going to Place B, p(B), should be 0

These ideal values are based on our guess about what the correct action should have been.

Moving Toward Optimization

Using these guessed ideal values, we can calculate the difference between:

the ideal probability for p(A)
the actual probability produced by the neural network

This allows us to calculate the derivative of the difference with respect to the bias we want to optimize.

In the next article, we will continue exploring how this optimization process works in reinforcement learning.

Just run:

ipm install repo-name

… and you’re done! 🚀

🔗 Explore Installerpedia here

Understanding Reinforcement Learning with Neural Networks Part 2: Why Backpropagation Is Not Enough

Rijul Rajesh — Sun, 10 May 2026 19:53:16 +0000

In the previous article, we explored an example where reinforcement learning is required and standard methods do not work.

In this article, we will understand why policy gradients are needed, and why the standard backpropagation method does not work in certain situations.

How Backpropagation Normally Works

Assume we have the following training data, where the desired outputs are already known:

Input (Hunger)	Output p(B)
0.0	0
1.0	1
0.1	0
0.9	1

With this data, we can feed the input values into the neural network one at a time.

The neural network produces an output, and we compare it with the ideal output value from the training data.

Using this difference, we can measure how wrong the network is.

Using Derivatives to Update the Bias

We can calculate these differences for different values of the bias and visualize how the error changes as the bias changes.

From this graph, we can calculate the derivative.

If the derivative is negative, we shift the bias to the right
If the derivative is positive, we shift the bias to the left

The derivative correctly tells us which direction to move because the training data already contains the ideal output values.

This is the basic idea behind backpropagation.

The Problem in Reinforcement Learning

However, in reinforcement learning, we do not know the ideal output values in advance.

For example, we do not already know whether choosing Place A or Place B is the correct action.

Because of this:

we cannot calculate the difference between the neural network’s output and the ideal output
without these differences, we cannot calculate derivatives in the normal way

A Different Approach

Instead, we can guess what the ideal outputs should be and use those guesses to estimate the derivatives.

This idea forms the foundation of policy gradients in reinforcement learning.

In the next article, we will explore how reinforcement learning and policy gradients help us solve this problem.

Just run:

ipm install repo-name

… and you’re done! 🚀

🔗 Explore Installerpedia here

Understanding Reinforcement Learning with Neural Networks Part 1: Learning Without Correct Answers

Rijul Rajesh — Fri, 08 May 2026 18:50:27 +0000

In this article, we will explore reinforcement learning with neural networks.

Let’s start with a simple example.

Choosing Between Two Snack Places

Suppose it is snack time, and you have to choose between Place A and Place B for fries.

To make a good decision, we also need to consider how hungry we are.

Some days we may be very hungry, while on other days we may only want a small snack.

We also need to consider how many fries each place might serve.

For example:

Place B might give a large quantity of fries, which would be great if we were very hungry
But if we were not that hungry, getting too many fries might not be ideal

Similarly:

Getting a small amount of fries would not be good if we were extremely hungry
But it could be perfectly fine if we only wanted a light snack

So, it would be useful to have a system that helps decide which place to choose based on:

our hunger level
the possible quantity of fries we might receive

Using a Neural Network

To solve this problem, we will use a neural network.

The neural network takes our hunger level as the input and outputs the probability of choosing Place B, written as p(B).

The Challenge

Normally, when training a neural network, we start with a training dataset that contains:

input values
correct output values

Using this data, we can train the network with standard backpropagation.

However, in this example, we do not know in advance whether Place A or Place B will serve a large or small quantity of fries.

Because of this, we do not know what the correct output values should be.

Reinforcement Learning

In situations where we do not have known output values, we can still train a model using reinforcement learning.

Instead of learning from correct answers, the model learns by trying actions and receiving feedback based on how good the outcome was.

In the next article, we will explore a reinforcement learning algorithm called policy gradients.

Just run:

ipm install repo-name

… and you’re done! 🚀

🔗 Explore Installerpedia here

Understanding Encoder-Only Transformers: The Foundation of BERT and RAG Retrieval

Rijul Rajesh — Thu, 07 May 2026 18:50:05 +0000

Back in 2017, the first transformer architecture introduced two main components:

an encoder
a decoder

These two parts were connected so they could work together.

This original design is known as an encoder–decoder transformer.

Decoders Can Work on Their Own

Over time, researchers realized that the decoder alone was powerful enough for many tasks.

Using only a decoder, models could:

generate text
continue sentences
perform translation and other language tasks

As we discussed in the article on decoder only transformers, these models form the foundation of systems like ChatGPT.

These are called decoder-only transformers.

Encoders Can Also Work Independently

In a similar way, encoder-based models are also very useful on their own.

This idea forms the foundation of models like BERT and many others.

These are called encoder-only transformers.

Building Blocks of Encoder-Only Transformers

Encoder-only transformers use the same core components we explored earlier:

Word embeddings convert words into numbers
Positional encoding keeps track of word order
Self-attention helps establish relationships between words

When these layers are combined, they create a new representation for each token that captures:

meaning
position
relationships with other words

These representations are called context-aware embeddings or contextualized embeddings.

Why Context-Aware Embeddings Are Useful

Context-aware embeddings can help group together:

similar sentences
similar paragraphs
similar documents

This capability is one of the foundations of Retrieval-Augmented Generation (RAG).

RAG works by:

Breaking documents into smaller chunks of text
Using an encoder-only transformer to generate embeddings for each chunk
Comparing embeddings to find the most relevant information

Other Uses of Encoder-Only Transformers

Context-aware embeddings can also be used as inputs for machine learning models.

For example:

neural networks can use them for sentiment classification
logistic regression models can also use them for classification tasks

That wraps up encoder-only transformers.

In the next article, we will explore reinforcement learning in neural networks.

Just run:

ipm install repo-name

… and you’re done! 🚀

🔗 Explore Installerpedia here

Understanding Decoder-Only Transformers Part 2: Decoder-Only vs Regular Transformers

Rijul Rajesh — Wed, 06 May 2026 19:44:50 +0000

In this article, we will look at the differences between a decoder-only transformer and a standard (encoder–decoder) transformer.

How Decoder-Only Transformers Work

A decoder-only transformer uses the same components to process the input prompt and to generate the output.

It relies on masked self-attention, which considers only the current word and the words that came before it.

Masked self-attention is applied to both:

the input prompt
the generated output

This means the entire process is handled by a single stack of decoder layers.

How Regular Transformers Work

A regular transformer has two separate parts:

an encoder to process the input
a decoder to generate the output

When encoding the input, it uses self-attention, not masked self-attention.
This allows each word to attend to all other words in the input, not just the previous ones.

The decoder then uses encoder–decoder attention to stay connected to the input.

In this mechanism:

queries come from the decoder
keys and values come from the encoder

This helps the decoder focus on the most important parts of the input while generating output.

What Really Changes Between Them

Decoder-only transformers use masked self-attention everywhere (for both input and output)
Standard transformers use:
- self-attention in the encoder
- masked self-attention in the decoder
- encoder–decoder attention to connect input and output

That wraps up decoder-only transformers.

In the next article, we will explore encoder-only transformers.

Just run:

ipm install repo-name

… and you’re done! 🚀

🔗 Explore Installerpedia here

Understanding Decoder-Only Transformers Part 1: Masked Self-Attention

Rijul Rajesh — Tue, 05 May 2026 19:25:51 +0000

Decoder-Only Transformers

In this article, we will explore decoder-only transformers.

Decoder-only transformers are a specific type of transformer architecture used in systems like ChatGPT.

Masked Self-Attention

Decoder-only transformers use a mechanism called masked self-attention.

Masked self-attention works by measuring how similar each word is to itself and to the words that come before it in the sentence.

For example:

“The pizza came out of the oven and it tasted good.”

When processing the word “pizza”, masked self-attention only considers the preceding word “The”.

Key Difference

Unlike standard self-attention, masked self-attention does not allow a word to look at future words. It can only attend to the current word and the words that come before it.

Because of this, it is also called an auto-regressive method.

An auto-regressive method is a way of predicting values step by step, where each prediction depends on the previous outputs.

The model uses its past predictions as input to generate the next output
It builds the final result one step at a time
Each step depends on what was generated before it, not what comes after

We will explore this concept in more detail in the next article.

Just run:

ipm install repo-name

… and you’re done! 🚀

🔗 Explore Installerpedia here

Understanding Transformers Part 18: Completing the Decoding Process

Rijul Rajesh — Mon, 04 May 2026 17:50:45 +0000

In the previous article, we generated the first output word from the transformer.

So far, the translation is correct. However, the decoder does not stop until it produces an <EOS> token.

Feeding the Output Back into the Decoder

Now, we take the translated word “vamos” and feed it back into a copy of the decoder’s embedding layer to continue the process.

Just like before, we repeat the same steps:

Get the word embeddings for vamos
Add positional encoding
Calculate self-attention values using the same weights used for the <EOS> token
Add residual connections
Compute encoder–decoder attention using the same set of weights
Add another set of residual connections

Generating the Next Word

Next, we pass the values representing “vamos” through the same fully connected layer and softmax function that we used earlier.

This time, the decoder outputs the <EOS> token, which signals the end of the sentence.

Final Output

At this point, the decoding process is complete.

We have successfully translated the input phrase using the transformer.

So, just to recap, the transformer works as follows:

Word embeddings convert words into numerical representations
Positional encoding keeps track of word order
Self-attention captures relationships within the input and output
Encoder–decoder attention connects input and output, ensuring important information is preserved
Residual connections help different components focus on specific tasks and improve training

In the next article, we will start exploring decoder-only transformers.

Just run:

ipm install repo-name

… and you’re done! 🚀

🔗 Explore Installerpedia here

Understanding Multi-Head Attention in Transformers

Rijul Rajesh — Sun, 03 May 2026 20:08:43 +0000

Self-attention already helps a transformer understand relationships between words using Query, Key, and Value. But there’s a problem.

One attention mechanism usually ends up focusing on a limited kind of relationship at a time.

Language doesn’t work like that. A sentence can have structure, meaning, and long-range links all at once.

That’s why transformers use multi-head attention.

What happens in multi-head attention

Instead of doing attention once, the model does it multiple times in parallel.

Each run is called a head, and each head has its own learned weights for Query, Key, and Value.

So every head looks at the same sentence, but in its own way.

How it flows

The input embeddings are first prepared
They are split into multiple heads using linear projections
Each head runs its own self-attention
Each head produces its own output
All outputs are joined back together
A final layer mixes them into one result

Why this works better compared to previous approach

Different heads naturally pick up different things:

word order and grammar
nearby word relationships
long-distance links
meaning-based connections

So instead of forcing one attention mechanism to do everything, the model spreads the job across multiple perspectives.

One head is like reading a sentence with one focus.

Multiple heads is like reading it several times, each time noticing something different, then combining those notes.

Multi-head attention doesn’t change the idea of self-attention. It just runs it multiple times in parallel so the model can understand language from different angles at once.

Just run:

ipm install repo-name

… and you’re done! 🚀

🔗 Explore Installerpedia here

Understanding Transformers Part 17: Generating the Output Word

Rijul Rajesh — Fri, 01 May 2026 20:53:08 +0000

In the previous article, we set up the residual connections to get the final output values from the decoder.

In this article, we begin by passing these two output values through a fully connected layer.

This layer has:

One input for each value representing the current token (in this case, 2 inputs)
One output for each word in the output vocabulary

Since our vocabulary has 4 tokens, this gives us 4 output values.

Selecting the Output Word

Next, we pass these 4 output values through a softmax function.

This allows us to select the most likely output word, which in this case is “vamos”.

So far, the translation is correct. However, the process does not stop here.

Continuing the Decoding Process

The decoder continues generating words until it produces an token, which indicates the end of the sentence.

To generate the next word, we feed the predicted word back into the decoder.

We will explore this step in the next article.

Just run:

ipm install repo-name

… and you’re done! 🚀

🔗 Explore Installerpedia here

Understanding Transformers – Part 16: Preparing for Output Prediction with Residual Connections

Rijul Rajesh — Wed, 29 Apr 2026 21:26:20 +0000

In the previous article, we handled values in encoder-decoder attention, now we will simplify the diagram a bit add another set of residual connections.

This allows the encoder–decoder attention to focus on the relationships between the output words and the input words, without needing to preserve the self-attention and positional encoding from earlier.

Lastly, we need a way to take these two values that represent the token in the decoder and select one of the four output tokens: ir, vamos, y, or <EOS>.

To do this, we pass these two values through a fully connected layer.

We will explore this further in the next article.

Just run:

ipm install repo-name

… and you’re done! 🚀

🔗 Explore Installerpedia here