Forem: Renato D. Prado

Agentic AI - Part 1: foundations

Renato D. Prado — Wed, 13 May 2026 20:01:23 +0000

Agentic AI: a tech lead's glossary

Study notes from coursers like Pluralsight on agentic AI and other references, organized as a glossary I wish I'd had on day one.

Every dev I know is using AI tools, and most of us are fuzzy on the words behind them. Where does a transformer fit in? What does MCP actually solve? Is "agentic AI" a real thing or just rebranded chatbots?

This is my map of the territory: machine learning at the bottom, agents and MCP at the top, and the concepts in between — tokens, memory, tools, RAG, vector databases. Built for lookup, not for a single read-through. If a term is fuzzy in your head, jump to it.

Foundations

Machine Learning

Normal coding: you write the rules. "If the email subject says 'free money,' mark it as spam."

Machine learning flips that. You don't write the rules — the program finds them by trial and error. You give it thousands of emails labeled "spam" or "not spam." For each one, the program guesses the label, compares its guess to the correct answer, and nudges a bunch of internal numbers (its weights) to be a little less wrong. Do that millions of times across thousands of examples, and those weights settle into something useful.

What does a weight actually look like? Picture the spam detector keeping a running score for each email. See the word "viagra"? Add 0.8 to the score. Unknown sender? Add 0.5. Lots of exclamation marks? Add 0.2. If the total clears a threshold, the email gets tagged as spam. Those numbers — 0.8, 0.5, 0.2 — are weights. Training is the process of finding what each one should be.

What comes out of training isn't code. It's the model — a file holding all those weights. A simple spam filter might have thousands. An LLM has billions. Hand the model a new email, and it produces a verdict: spam or not.

Want to feel it for yourself? Try Google's Teachable Machine (teachablemachine.withgoogle.com). Train an image classifier in your browser with your webcam — show it a few photos of "happy face" vs "sad face" and watch it learn to tell them apart.

Where LLMs fit

A few useful distinctions before going further:

AI is the broad field — anything where a machine does something we'd call intelligent.
Machine learning is a subset of AI: systems that learn patterns from data instead of being explicitly programmed.
Deep learning is a subset of ML that uses multi-layer neural networks.
LLMs are a specific application of deep learning, built on the transformer architecture (introduced in 2017).

So: LLM ⊂ deep learning ⊂ ML ⊂ AI. Claude, ChatGPT, Gemini, Llama — all LLMs.

These terms get used interchangeably, but they're not the same thing. A spam filter and ChatGPT are both AI, both ML, but only one is an LLM.

Neural networks and deep learning

A neural network is one specific kind of ML algorithm. Loosely inspired by the brain: layers of nodes connected to each other, every connection carrying one of those weights from before. Input enters on one side, flows through the layers, a prediction comes out. The training loop is the same as before — guess, check, nudge the weights — there are just far more weights to nudge.

"Deep learning" is just a neural network with many layers — the deep refers to the layer count.

Why depth matters: each layer learns a more abstract pattern than the one below it. For image recognition, the first layer might detect edges; the next, shapes; the next, eyes and noses; the next, full faces. For text: letters, then words, then phrases, then meaning. Stack enough layers and the model captures complex patterns.

The deep learning boom of the 2010s happened because GPUs made training big networks practical, and there was enough digital data to feed them.

LLMs are deep neural networks. Specifically, ones built on an architecture called the transformer.

Transformers — and what makes an LLM different

Earlier language models processed text word by word, in order — slow to train, and they tended to lose track of context across long sentences. The transformer, introduced in a 2017 Google paper called Attention Is All You Need, fixed both problems with two changes:

It looks at the whole input at once. No more reading sequentially. The model sees the full sequence and processes it in parallel — which also made training scale across GPU clusters.
Attention pairs every word with the others that matter. Think of it like reading with a highlighter, where the highlighting is automatic — learned from training data. For each word in the input, attention scores the others by relevance, weights them, and blends the most important ones into how that word is processed.

Concrete example: in "The cat sat on the mat. It was tired," when the model reaches "it," attention scores "cat" high and "mat" low. That's how the model knows "it" refers back to "cat."

Every well-known LLM since runs on this architecture: GPT, Claude, Gemini, Llama. Same core design, just bigger training data and more weights.

So what's actually different about an LLM compared to the spam filter from earlier?

Generality. Old ML: one model per task. An LLM is a single model that handles summarization, code, translation, and reasoning.
Generation vs. classification. Old ML predicts a label or a number ("spam" / "not spam"). LLMs produce text, which lets them do almost anything we can describe in words.
Scale. Old ML: thousands to millions of weights, single GPU. LLMs: billions of weights, GPU clusters, weeks of training, millions of dollars per training run.

Tokens

A token is a chunk of text the model reads — sometimes a full word, sometimes a fragment. "Hello, world" might become three tokens: Hello, ,, world. The word "strawberry" might split into two: straw, berry. Each token is then mapped to a number, and those numbers are what the model actually processes.

Every model has its own tokenizer (the piece that does the chopping), so the same sentence in GPT, Claude, and Gemini can produce different token counts.

The practical consequence: context limits and pricing are measured in tokens, not words. A model's context is how much text it can work with at once — its working memory. Rule of thumb: 1 English word ≈ 1.3 tokens.

Every API charges two things separately: input tokens (everything the model reads — system prompt + conversation history + retrieved memory + tool definitions + the new user message) and output tokens (what the model generates, typically priced higher per token than input).

Every turn, the entire conversation history gets sent as input. By turn 10, the input includes 9 prior turns plus the new message. Token usage grows with chat length.

Agents

Agentic AI

A regular LLM chat is one prompt in, one response out. Agentic AI turns that into a loop: the LLM picks an action, the system runs it, the result feeds back, the LLM picks again. Actions come from tools — APIs, code execution, file edits, database queries.

There's a middle ground between chat and a full agent:

Workflow — a fixed pipeline of LLM calls. The path is hard-coded; the LLM just fills in the steps. Example: classify ticket → summarize → draft reply.
Agent — no fixed path. The LLM is given a goal and a set of tools, picks the next action, observes the result, and picks again. The loop continues until the goal is reached or a stop condition fires (step limit, token limit, failure).

The key difference: who decided the next step — a human writing the pipeline, or the LLM at runtime?

The agent loop

An agent is an LLM in a loop with tools and memory. One cycle:

The agent receives a task.
The LLM picks what to do next.
The agent runs a tool, calls an API, writes a file, etc.
The agent observes the result.
Back to step 2 with the new information.

You'll see this written different ways — perceive → reason → act → learn, or plan → act → adapt. Same loop.

Three components

An agent has three:

LLM — the reasoning engine.
Memory — what the agent remembers across steps and sessions.
Tools — what the agent can actually do (call APIs, run code, search files, edit data).

Strip any one and you're back to plain chat.

Four characteristics

Autonomous execution. Runs without step-by-step human input.
Goal-oriented. Works toward an objective, not a fixed script.
Proactivity. Initiates actions on its own.
Collaboration. Can work with other agents (multi-agent systems).

Memory

An LLM has no memory of its own. Each call to the model is independent — it doesn't remember anything from the previous call. Any continuity you experience in ChatGPT or Claude comes from the conversation history being passed back in every turn.

Memory is storage that gets auto-injected into the prompt. When the agent needs context, the system pulls relevant entries from storage and adds them to the prompt before the next LLM call.

How it works in practice:

The agent stores facts somewhere — JSON file, relational database, vector database, a dedicated memory service.
Each turn, the system retrieves whatever matches the current question (recent messages, facts about the user, prior decisions).
The retrieved content gets prepended or inserted into the prompt.
The LLM sees it as context, no different from anything else in the prompt.

Two kinds:

Short-term — the conversation so far. The running list of messages passed each turn.
Long-term — facts kept across sessions. ChatGPT's memory feature is the consumer example: tell it your favorite dish, come back next week, and it remembers — because "user's favorite dish is feijoada" got stored and gets injected the next time you ask.

Compressing history

Memory has a hard limit: the model's context window. Once a chat grows past that, something has to give — either older messages get dropped (the agent forgets) or they get compressed.

Agent systems handle this automatically. As the conversation approaches the context limit, the system makes a separate LLM call to summarize older messages, then carries the summary forward in place of the raw history. The LLM keeps working without hitting the wall.

Side effect: every future turn sends fewer tokens, which lowers cost.

Tools

A tool is something the agent can do beyond producing text. The LLM outputs "call tool X with these arguments," the system runs it, and the result feeds back into the next LLM call.

Common tools:

File operations (create, edit, delete, search)
Web search
API calls
PDF parsing
Code execution
Database queries

In API terms, this is called tool use (Anthropic) or function calling (OpenAI). You declare the available tools to the LLM up front; the model decides when to call them.

Without tools, the agent only produces text.

MCP (Model Context Protocol)

LLMs use tools to take action — file operations, database queries, web searches. Historically, each LLM provider defined its own format for declaring tools (OpenAI function calling, Anthropic tool use, Google function declarations), so the same integration had to be rewritten for each.

MCP is an open protocol that standardizes how applications expose tools to LLMs. Anthropic released it in November 2024; OpenAI added support in early 2025.

Three pieces:

MCP Server — exposes a set of tools. Examples: a file system server, a Postgres server, a GitHub server.
MCP Client — sits inside the LLM application, forwards tool calls to servers, and returns the results.
MCP Host — the application the user interacts with (Claude Desktop, Cursor, etc.). It runs the client.

Flow: the user talks to the host → the LLM decides to call a tool → the client routes the call to the right server → the server runs it → the result feeds back into the LLM.

Reference: modelcontextprotocol.io. Community-built servers are listed on GitHub.

RAG (Retrieval-Augmented Generation)

An LLM only knows what it was trained on, up to a cutoff date. Events from last week, your company's internal docs, today's stock prices — none of it is in the model.

RAG is the pattern of fetching relevant information from outside the model and injecting it into the prompt before the LLM generates its answer.

How it works:

The system detects that the question needs information the model doesn't have.
It queries an external source — web search, a vector database, a file system, an API.
The relevant chunks of content come back.
Those chunks get inserted into the prompt as context.
The LLM answers using the fresh content.

Example: asking Claude "who won last weekend's F1 race?" — the model doesn't know. Enable web search, the system retrieves a result page, passes the content to the model, and the model answers from the retrieved text.

RAG challenges

Irrelevant retrieval — the system pulls back content that doesn't help, and the LLM answers off-topic.
Chunking — long documents have to be split into pieces small enough to fit the prompt while preserving meaning.
Access control — when the source contains private data, the system has to respect permissions. The model shouldn't see what the user shouldn't.

Vector databases

A vector database stores text (or images, or other data) as arrays of numbers, then lets you search by similarity instead of exact match.

The numbers come from an embedding model. Feed it text (or an image, or audio) and it returns a list of numbers — typically hundreds to thousands of them — that captures the meaning of the input. Think of each embedding as a coordinate in a high-dimensional space where similar things sit close together.

Two examples to make this concrete:

The embeddings for "cat" and "dog" sit close together. "Helicopter" is far from both.
"How do I reset my password?" and "I forgot my login" sit close together, even though they share no words.

A vector database stores those coordinates and answers queries by returning the entries closest to your input.

Concrete use case: a company chatbot that answers questions from internal docs.

Setup (one time):

Break the company wiki into chunks (paragraphs, sections, whatever fits).
Run each chunk through an embedding model. Get back one vector per chunk.
Store the chunks alongside their vectors in a vector database (Pinecone, Chroma, pgvector).

Query time (every user message):

The user asks: "How do I file expense reports?"
You embed the question with the same embedding model.
The database returns the chunks whose vectors are closest to the question vector — say, the top 5.
Those chunks get injected into the LLM prompt: "Use this context to answer: [chunks]. Question: How do I file expense reports?"
The LLM answers using the retrieved content.

Common vector databases: Pinecone, Weaviate, Chroma, Milvus, pgvector (a Postgres extension).

To be continued

Part 1 covers the foundations: ML, transformers, tokens, agents, memory, tools, MCP, RAG, vector databases.

Part 2 will cover the applied side — agentic coding, global rules and CLAUDE.md, vibe coding vs context engineering, multi-agent systems, guardrails, observability, and CI/CD with agents.

TechMag 1 - May 1

Renato D. Prado — Tue, 12 May 2026 15:43:08 +0000

The work moved. The craft didn't.

A weekly read for devs, tech leads, and anyone keeping up with where AI is taking software engineering..

Try This Week

How to keep up

The New Senior Engineer

Chris Parsons — chrismdp.com

Less reviewing, more training.

If you're still tied to your IDE, you're working a year behind. The 2026 toolchain recommendation: install Claude Code or Codex CLI, not Copilot. The senior engineer's job has shifted from writing code or reviewing diffs to training the AI to write better code.

The central point: the harness around the model matters as much as the model itself. Leverage lives in standing instructions in AGENTS.md or CLAUDE.md and in skill files that teach the agent your conventions.

How fast you can tell whether the output is right separates teams that ship from teams stuck.

As Chris puts it: "Coding with AI is now the default. The question is whether you are doing it as a reviewer, a prompter, or a trainer. The trainer role compounds."

The Harness

Henrique Bastos — blog post

Some teams keep getting better. Others run in place forever.

The difference is whether they've built a harness — something that turns repetitive work into a loop.

Most processes are designed from scratch every time, with context living in someone's head. Each run costs the same as the first. In a loop, each run leaves things ready for the next. Starting again is almost free.

AI made this personal. When an agent produces garbage, the instinct is to fix the prompt. But that fix dies with the session. The real move is to fix the environment — add a test, tighten a boundary, write a rule. That fix sticks. Every future run inherits it.

Quality & Craft

How to think well in the AI era.

Below the Illusion of Progress

Kent Beck — LinkedIn

The new failure mode: When the code looks fine but isn't.

Most teams get through problems in a messy way — without really solving them. The result is software that mostly works, but is hard to change. Not great, but it holds.

There's a worse state: when it becomes possible to claim things are working when they aren't, as AI tools do. Complexity builds, and progress slows down.

No clear solution yet for working with AI here. Better data, tests, prompting — all might help, but none is enough. The first step is awareness.

The Principles Don't Change

Robert C. Martin — X/Twitter

AI raises the level of abstraction — it doesn't change software engineering.

Uncle Bob Martin has been on a roll lately, and his core message is consistent: AI doesn't change software engineering — it just raises the level of abstraction.

"AI is just another step up the semantic expression ladder. We initially expressed our semantics in binary, then assembler, then Fortran, then C, then Java, then Python. AI is just the next step up that same old ladder."

What we're losing is syntax — semicolons, braces — and good riddance. What stays is everything that actually matters: design, architecture, formalism, behavioral and structural semantics. Objects still matter (the AI is literally generating them). Principles from decades ago still apply.

His warning to engineers: a few disciplines and tools aren't enough. You still need a mental model of what the AI is doing, the engineering insight to correct it, and the instinct to form suspicions and verify them — without falling back on exhaustive code reviews.

Creation Cost Approaching Zero

Kent Beck — LinkedIn

If anyone can clone your software with an AI, what happens to scale?

Software used to work like this: one person spends a long time building something great, and then millions of people use that same thing. The work pays off because it's spread across a huge number of users.

AI may break this. Maybe people will be able to ask an AI to clone your software. So instead of everyone using your version, you get multiple near-identical copies floating around — and you never reach the scale you used to.

But here's the twist: building software is also cheaper now. You didn't spend as much time on it either. So maybe it doesn't matter that you have fewer users.

Worth Knowing

On the radar this week

Legacy Code Isn't Going Anywhere

Vikas Pujar — LinkedIn

Even when AI nails the headline task, the work around it doesn't disappear.

Vikas Pujar asked Claude how long it would actually take to convert the world's 800 billion lines of COBOL into Java, now that Anthropic claims AI can handle the translation. The answer: in the best case, 844 years.

The interesting part wasn't the timeline — it was the breakdown. Code translation is only about 20% of the effort. The other 80% is parallel testing, business approvals, regulatory compliance, and deployment — work where AI moves the needle far less.

A useful reminder that even when AI is genuinely good at the headline task, the work around it doesn't disappear. It seems there's still a long career waiting for anyone willing to do it.

Structured-Prompt-Driven Development

Wei Zhang — martinfowler.com

Treating prompts as code — but is it too heavy for most teams?

Thoughtworks recently published Structured Prompt-Driven Development (SPDD) on Martin Fowler's site — a method that treats prompts as first-class artifacts: version-controlled, reviewed, and reused. The core rule is when reality diverges, fix the prompt first, then update the code.

It's a serious framework, with its own seven-part canvas (REASONS) and a CLI tool (openspdd) to run the workflow.

But here's the thing: a lot of ways to work with AI are emerging right now, and many of them come from the AI tools themselves — CLAUDE.md, skill files, AGENTS.md. These are lightweight, already integrated into the tools developers actually use, and easy to adopt one piece at a time. Heavyweight methodologies like SPDD might be the right fit for big consultancies and regulated environments, but will most teams actually adopt it — or will the simpler conventions baked into Claude Code or Codex win on adoption alone?

GitHub Opted Everyone Into Training

Gergely Orosz — LinkedIn

Your private code is training Microsoft's AI — unless you opt out.

Gergely Orosz raised the alarm: GitHub quietly opted every user — paying customers included — into letting their private code train Microsoft's AI models. Pro subscribers, Copilot subscribers, all of it. The opt-out lives under Settings → Privacy, and Orosz's point is sharp: a true platform for code wouldn't do this.

Worth checking your settings if you haven't already.