Forem: Aagam

Neural Networks: The First Step Toward Understanding Transformers

Aagam — Mon, 16 Mar 2026 19:04:51 +0000

Understanding Neural Networks — The Foundation You Need Before Transformers

Whenever we try to start learning about LLMs, the first word we come across is "Transformer," which is based on the paper Attention Is All You Need. But when we start reading about it, we find it confusing. Do you know why? Because while reading, we come across a popular term — "neural architecture" and we're like, "Yeah, we understand it," but to be honest, we don't.

My goal is to make all the basics required to understand Transformers clear. In today's post, we will discuss neural networks, divided into five sections:

What is a neural network?
Why do we need a neural network?
How does it work?
What advantages does it provide, and what are the use cases and limitations?
What more can be explored in the future — research areas

1. What Is a Neural Network?

A neural network, in a very simplified way, is nothing but a way to make our model mimic how our brain functions or, you could say, how it processes information. It consists of internal units called neurons (nodes) that transform the input data with the help of weighted connections and nonlinear functions to learn patterns from the data.

2. Why Do We Need a Neural Network?

Now another question arises: why do we even need a neural network? Before neural networks, we were heavily dependent on feature engineering and handcrafted rules for predictions. But this approach struggled when handling complex data, which gave rise to the need for better functionality and that was the neural network.

3. How Does a Neural Network Work?

Now the hardest but most exciting part how does a neural network work exactly? I think this is something that is generally ignored by a lot of people, and that's what I think filters out good and bad engineers in today's era. To understand how it works exactly, we will divide the process into six simple steps.

Step 1 — Input Layer

A neural network does not understand raw images, text, or audio and thus everything must be converted into a numerical vector format. As the name specifies, this layer receives the input data, which can be anything: a video, image, or anything else. It uses multiple features of those inputs and converts them into one single input vector. For example, to predict house prices: Input = [size, rooms, location_score].

Step 2 — Linear Transformation

Once the input is converted into numerical vector format, we perform a linear combination of inputs. Confused? Yeah, I was too like, what the hell is a "linear combination of inputs," and why do we do it?

I had those questions too. I referred to blogs and GPT but couldn't understand it. Let's understand it together.

So, let's first understand what we have: a vector from the input (image, video, or whatever), which will of course have multiple features. We don't know which features we need to give importance to — even our model doesn't. So we assign weights. How are the weights calculated? That's another big topic, but to simplify: we assign random weights at first, train the model on the input data, calculate the error with our weights, adjust the weights, and repeat the process thousands of times to arrive at valid weight values.

We then multiply these weights with the input features and sum them up. But why do we do that? Because it tells the system how strongly it should weigh each feature for its prediction. However, it's still linear we don't really know the true relationship yet. That is why we perform the next step.

Step 3 — Activation Function

So what we currently have is a raw signal — or, for people like me, I'd say a "random value," because I don't really know what the value means after we obtained it from the linear transformation.

For this particular blog, I'm taking ReLU as the activation function, but there are other activation functions as well. ReLU converts this raw value by turning negative values to 0 and keeping positive values as they are. It basically means: if the answer is 0, don't fire (or use) this neuron; otherwise, fire it. Whatever output we get from the activation function becomes the input for the next step. Why? Let's see and understand together.

Step 4 — Forward Propagation (Information Flows Through the Network)

To understand forward propagation, we need to first understand the structure. Let's take a simple neural network architecture: it consists of multiple layers, each layer extracts meaningful information from the raw data, and each layer has multiple neurons. Each neuron has one job — to check if a particular pattern exists, and if it does, how strongly it exists.

To make it simple, let's take an example. Suppose I have three layers:

Layer 1 detects simple patterns one neuron detects edges, another detects corners, and another detects shapes. It then sends the information to Layer 2.
Layer 2 detects complex shapes and patterns from the previous layer's output. It then forwards the information to Layer 3.
Layer 3 produces the final output. We apply a softmax function, which converts the raw output into probabilities for example, detecting whether something is a dog's ear, a cat's ear, or something else.

(Kindly find the diagram below.)

Step 5 — Loss Function

Now, once our model makes its prediction after forward propagation, it checks whether its prediction was correct or incorrect. We use cross-entropy loss:

L = −∑ y log(ŷ)

Cross-entropy measures how wrong the prediction is. Here, y is the actual class and ŷ is the predicted class. From the log function, we understand that the loss is small when the correct class has a high probability and the loss is large when the correct class has a low probability.

So our model's end goal is to minimize the loss. Once it gets the value of the loss, it moves to the next step: backward propagation.

Step 6 — Backward Propagation

We all kind of get the gist in backpropagation, the network goes back and finds which layer contributed how much to the error and then adjusts the weights accordingly, right? But have you ever wondered how it knows who contributed to the error and how much?

Well, here comes the main part: backward propagation uses the chain rule. Now, I know a lot of us get scared when we hear "chain rule," but let's simplify it together. In the chain rule, we basically ask: "If we change the weight for a particular neuron slightly, how much does it affect the loss function?" We get the derivative using the chain rule, and once that is done, we update the weight using the formula below:

new_weight = old_weight − learning_rate × gradient

The gradient here is basically the derivative we found for that particular weight using the chain rule. And yes, generally the learning rate is constant across the whole model, but in practice, we change it during training.

4. Advantages and Limitations of Neural Networks

Advantages

With the way neural networks adjust weights by themselves and learn patterns, they can definitely handle complex pattern recognition where traditional models used to fail.
They can extract features automatically without the need for manual feature engineering.
They became the foundation of modern AI.

Limitations

Although neural networks come with a lot of advantages, there are some limitations too:

They need a huge amount of data with a small dataset, there is a high chance the model will overfit.
As we've seen in the "how it works" section, they require high computational cost.
Lack of interpretability we don't even understand why the model made a specific decision. This same problem continues in today's modern AI, and even the CEOs of big AI companies say they don't fully know how it's working exactly.

5. What More Can Be Explored — Research Areas

For me personally, one thing that can really be researched is mechanistic interpretability of neural networks understanding how neural networks internally represent concepts and make decisions. To frame it as a question:

How can we automatically discover interpretable functional circuits inside large neural networks instead of relying on manual neuron-level analysis?

I Got a Surprise API Bill. So I Built a Runtime That Enforces Agent Budgets.

Aagam — Tue, 10 Mar 2026 05:35:58 +0000

I was running an AI agent nothing fancy, just a research task. Left it running overnight.
Woke up to a bill I didn't expect.
The agent hadn't done anything malicious. It just... kept going. Looping, retrying, calling the model over and over, because nobody told it to stop. No token cap. No cost limit. No guardrails whatsoever.
That's the thing nobody talks about with AI agents: they're eager. Give them a task and they'll spend whatever it takes to finish it — or to try to finish it. And if you're not watching, you find out the hard way.
That's why I built Joule.

The Core Problem
Most agent frameworks — LangChain, CrewAI, AutoGen are great at building agents. Defining tools, chaining steps, routing between models. They solve the "how do I make the agent do things" problem well.
But none of them solve the "how do I stop the agent from doing too much" problem.
There's no built-in budget. No hard stop. No governance. You're basically handing your agent a credit card with no limit and hoping for the best.

What Joule Does Differently
Joule is an agent runtime where every task runs inside a budget envelope. You set limits before execution. The agent operates within them. When a limit is hit, it stops — cleanly, with a structured result.
Here's the simplest usage:
typescriptimport { Joule } from '@joule/core';

const result = await joule.execute({
description: "Analyze our Q4 metrics and draft a summary",
budget: 'medium',
});

console.log(result.result);
console.log(Cost: $${result.budgetUsed.costUsd} | Tokens: ${result.budgetUsed.tokensUsed});
The agent runs. Hits a step where it would exceed $0.50. Stops. Returns what it has. You get a budgetUsed breakdown with every execution — token count, dollar cost, latency, tool calls.

Budget Enforcement Across 7 Dimensions
One thing I wanted to get right was what gets budgeted. Token and cost limits are obvious. But agents can go wrong in other ways too.
Joule tracks and enforces limits across:
DimensionWhat it limitsTokensTotal LLM tokens consumedCost (USD)Dollar spend on API callsLatencyWall-clock timeTool callsNumber of tool invocationsEscalationsUpgrades from cheap → expensive modelsEnergy (Wh)Estimated compute energyCarbon (gCO₂)Estimated emissions
The energy and carbon tracking came from my other research in LLM inference optimization — it felt wrong to build a "cost-aware" runtime that ignored the environmental cost. So those dimensions are first-class.

Smart Routing: Start Cheap, Escalate Only When Needed
One of the key things that makes Joule efficient is the model router. By default, it picks the smallest model that can handle the task — local Ollama, gpt-4o-mini, Haiku — and only escalates to a larger model if the task genuinely needs it and the budget allows it.
For simple tasks, the planning prompt drops from ~2900 tokens to ~50 by stripping tool descriptions entirely. It's not a trick — it's just: don't send information the model doesn't need.
The result: in benchmarks against CrewAI across 30 tasks, Joule was 1.5x faster on average, with the biggest gap on generation tasks (1.8x). And that's before counting the budget enforcement and governance that CrewAI doesn't do at all.

Governance: What the Agent Is Allowed to Do
Budget is about how much. Governance is about what.
Joule has a constitutional layer — three tiers of rules:

Hard boundaries — never violated, no override. ("Never expose PII.")
Soft boundaries — can be overridden with authority + audit trail. ("Prefer local models.")
Aspirational principles — guide behavior, don't block execution. ("Minimize token usage.")

You configure it in YAML:
yamlgovernance:
constitution: default
requireApproval:
- shell_exec # human-in-the-loop before running shell commands
- file_delete # prevent accidental data loss
budget:
maxCostUsd: 1.00 # hard stop
And there's a trust scoring system — agents earn autonomy through clean behavior. New agents get watched closely. Clean track record → more tools unlocked, less oversight. A violation → demotion, increased scrutiny, or quarantine.
It sounds elaborate, but it basically answers the question: if I'm running 10 agents in parallel, how do I know they're not doing something I didn't ask for?

Current Status
The runtime is in active development. 1140 tests passing across 91 files. The core — task execution, budget enforcement, model routing, governance, multi-agent crews — is solid. The observability layer (React dashboard, Prometheus metrics, OTLP/Langfuse export) is working.
Known rough edges: the computer/desktop automation agent is good for Office tasks but struggles with complex browser workflows. The governance layer is implemented but still maturing.

Why This Matters Now
Agent frameworks are everywhere. The "build an agent in 10 lines" demo is easy. Shipping an agent to production — one that doesn't blow your budget, doesn't do things you didn't authorize, and that you can debug after the fact is still genuinely hard.
That's the gap Joule is trying to close.
If you've ever woken up to a surprise API bill, you already understand the problem. The solution shouldn't be "just remember to add a token limit." It should be baked into the runtime.

GitHub: github.com/Aagam-Bothara/Joule
Feedback, issues, and contributions very welcome. Still early — but the foundation is there.