Forem: Austin Starks

Your AI trading bot will fail because it's optimizing the wrong thing. Here's how to fix it.

Austin Starks — Mon, 13 Apr 2026 17:06:40 +0000

Note from the author: You're reading a Dev.to adaptation. The original on NexusTrade includes interactive trace viewers, animated diagrams, equity curve visualizations, and embedded course exercises. Read it there for the full experience.

You built the agent. You gave it tools. You hooked up the memory. You ran it overnight on five years of historical data and it produced a 140% annual return.

You deploy it with $25,000 on Monday. By Friday you've lost 30%.

This is not a bug in your code. This is overfitting. Your agent didn't find a durable market edge. It memorized the historical data and learned to exploit noise that doesn't exist in live markets. The AI succeeded at the goal you gave it. The goal was wrong.

Evaluation is the part of agent development that nobody talks about because most people building AI demos have never run a system long enough to see it fail. This article covers the engineering that keeps it from happening: traces, LLM judges, and a feedback loop that makes agents actually improve over time.

The Problem

The optimization trap: when the agent succeeds at the wrong thing.

Every machine learning practitioner knows about overfitting. You train a model on historical data. It learns the data perfectly, including all the noise and anomalies specific to that dataset. When you expose it to new data, it falls apart because what it learned wasn't a real pattern.

AI agents have the same problem, and it's harder to catch. When you tell an agent to "build a trading strategy with the highest possible backtest return," it will do exactly that. It will explore every combination of indicators, time windows, and position sizes until it finds something that maximizes the metric you asked for.

The result looks impressive. 126% return in 2024. You deploy it with $25,000 on Monday. By Friday you've lost 30%. The 2022 bear market destroyed it completely — either because the agent never tested 2022, or because the evaluation criteria didn't penalize drawdown, so the agent ignored it.

The optimization trap is an evaluation failure, not a model failure. The model did what you asked. You asked for the wrong thing.

The fix is not a better model. It's a better evaluator: one that grades the agent on what actually matters, penalizes single-year outlier returns, requires multi-regime evidence, and gets stricter every round.

Wrong objective	Right objective
Highest backtest return	Consistent returns across regimes
Single-year Sharpe ratio	Positive 2022 bear market performance
Win rate on in-sample data	Max drawdown below 30%
Score goes up each round	Multi-year evidence, not one outlier

I know this because I ran the experiment. Five rounds of hill climbing on a live trading agent. $676 spent. The first round scored 71 and produced an Iron Condor with a 54% average annual return. By Round 5, the score had dropped to 27 and the agent was recommending long directional options with a -6.3% average and a 92% drawdown in 2022. The evaluator caused every step of the decline.

Observability

The flight recorder: what a trace actually is.

When a traditional app crashes, you read the stack trace. When an agent fails on iteration 7 of a 12-step ReAct loop, you're guessing — unless you have a trace.

A trace is a structured log of every step in the agent's execution. Every input the model saw, every decision it made, every tool it called, every result it got back, every token it spent, every millisecond it waited. If something goes wrong at 3 AM, you can reconstruct exactly what happened.

📊 Interactive trace viewer — view on NexusTrade

Without traces, a failed agent run is a black box. You see the final answer (or the error), and you guess at what went wrong. With traces, you can pinpoint the exact iteration where the model made a bad assumption, called the wrong tool, or misread a result.

Evaluation

How to grade an agent: algorithmic metrics and LLM judges.

Not all evaluation is the same. Some things you can measure with code. Some things you need a second AI to grade.

Type	What it measures	Examples
Algorithmic	Objective, countable things. No model needed.	Total cost, iteration count, latency, Sharpe ratio, max drawdown, whether a strategy was deployed
LLM Judge	Subjective quality that requires reasoning.	Did it explain the strategy logic clearly? Did it test enough different structures? Is the recommendation realistic?

For trading agents specifically, overfitting is the failure mode that pure algorithmic metrics miss. A high backtest return is an objective number. But whether that return is trustworthy — whether it comes from a durable edge or from memorized noise — requires judgment. That's where the LLM judge comes in.

Here's the central question embedded in the NexusTrade Agent Run Evaluator's system prompt (Gemini 3 Pro, temp 0):

"The central question you must answer is: If the user deployed the recommended strategy on Monday with their $25,000 live account, how confident are we that it will achieve 100% annual return? Everything else is secondary."

The hard caps are the anti-overfitting mechanism:

Average annual return	Max score
Below 30%/yr	deployedStrategyFitness cannot exceed 3
30–59%/yr	Cannot exceed 6
60–89%/yr	Cannot exceed 8
90%+/yr, survived bear, drawdown below 50%	Full range available
Nothing deployed	Capped at 2, regardless of exploration quality

An agent that found 126% returns in 2024 but only tested one year cannot score above 4, because single-year outlier performance is exactly what overfitting looks like. The evaluator enforces multi-year evidence as a precondition for a high score.

Here's what the evaluator looks like running against a real Aurora agent run:

And the structured JSON it returned for Round 1 of the hill climbing experiment:

{
  "summary": "Deployed a robust Iron Condor across 4 regimes. Consistent positive years including 2022 bear. Return profile (54% avg) won't reach 100% goal.",
  "deployedStrategy": "Always-On Iron Condors (SPY/QQQ)",
  "deployedStrategyAvgReturn": "+54.34% avg (2022: +31.2%, 2024: +72.1%)",
  "deploymentVerdict": "iterate_first",
  "scores": {
    "deployedStrategyFitness": 7,
    "evidenceStrength": 7,
    "explorationCoverage": 6,
    "riskRealism": 5
  },
  "overallScore": 71,
  "verdict": "good",
  "failures": ["59% max drawdown exceeds safe threshold", "54% avg won't reach 100% annual goal"],
  "nextIteration": "Push for higher return while maintaining the 2022 floor."
}

The nextIteration field is what makes the loop work. It becomes the seed for the next agent run. The evaluator writes the coach's notes.

You call it from anywhere via MCP:

run_agent_run_evaluator({
  agent_id: "69d49c51d06eee7b51cf5f68",
  model: "google/gemini-3-pro-preview"
})
// Returns: scores, verdict, nextIteration, deploymentVerdict
// Inject nextIteration into the next run's context.

Going Deeper

Beyond hill climbing: four approaches to agent optimization.

Once you have traces and an evaluator, the natural next step is to close the loop: run the agent, score the output, use that score to improve the next run. The simplest version of this is hill climbing — run, grade, seed the next run with the feedback.

But hill climbing is a local search. It follows the gradient of whatever metric you give it. Point it at the wrong objective and it will confidently optimize you into a cliff. Here's the full landscape:

01. Hill Climbing — Run → grade → seed the next run with the feedback. Simple, cheap, and effective for small prompt spaces. Gets stuck in local maxima when the feedback signal points in the wrong direction. Use it as a baseline before investing in more sophisticated approaches — but watch your rubric carefully.

02. NSGA-II Multi-Objective Optimization — When you have competing objectives (high return AND low drawdown), hill climbing optimizes one at the expense of the other. NSGA-II produces a Pareto front instead: every efficient tradeoff simultaneously, so you pick from real data rather than assumptions. The GA4GC paper achieved 37.7% runtime reduction while improving code quality, with 135x hypervolume improvement over defaults.

03. Meta-Harness (Stanford) — Most prompt optimization focuses on what you say to the model. Meta-Harness optimizes what information the model sees: which context to store, what to retrieve, how to structure inputs. An agentic proposer reads up to 10 million tokens of diagnostic context per iteration and rewrites the harness itself. The Stanford paper shows 7.7 point improvement over state-of-the-art using 4x fewer tokens.

04. AutoResearch-RL — An RL agent that proposes code modifications to the training script, executes them under fixed time budgets, observes validation metrics, and updates its policy via PPO. No human in the loop. A related study analyzing 10,469 experiments found architectural choices explain 94% of performance variance. The autoresearch framework demonstrates a 2.4x boost in experiment throughput by aborting poor-performing runs early.

For most teams: start with hill climbing. Run it until it stalls. If you hit a genuine multi-objective tradeoff, consider NSGA-II — but only with automated infrastructure and a clear budget. The rubric is still load-bearing in all four cases.

Real-World Evidence

I ran this loop five times. The first round was still the best.

War story — $676 spent: Five complete agent runs. Full evaluator traces for each round. Round 1 scored 71 — a robust Iron Condor on SPY/QQQ averaging 54% across all regimes including 2022. By Round 5, score was 27. The agent was recommending long directional options with a -6.3% average and a 92% drawdown in 2022. The evaluator caused every step of the decline. The nextIteration note from Round 1 said "push for higher return while maintaining the 2022 floor." The agent pushed for higher return. It forgot the floor. Each round the evaluator rewarded the attempt at higher returns and the agent drifted further from the Iron Condor structure that actually worked. By Round 5, it had completely abandoned a 54%-average strategy in search of 100%, and found a disaster instead. The rubric didn't penalize abandoning what worked. So the agent abandoned it.

Read the full hill climbing experiment

Production Architecture

The feedback loop: how evaluation connects to memory.

Evaluation doesn't mean much if the agent can't learn from it. The feedback loop is what turns a one-time grade into compounding improvement.

The full pipeline in five steps:

Step 1 — Agent Run: The agent receives its task, reads injected memory from past runs, and works through the ReAct loop. Every decision, every tool call, every result is captured.

Step 2 — Trace Captured: Every iteration is logged: the model's thought, the tool it called, the result, token cost, and latency. The trace is stored and ready to be read by the judge.

Step 3 — LLM Judge Scores: The evaluator reads the full trace and returns a structured verdict: scores on 4 dimensions, an overall score, a deployment verdict, and a nextIteration note.

Step 4 — Written to Memory: The score, verdict, and lessons are written into an AgentSummary document in MongoDB. The next run retrieves this via the memory system from Module 4.

Step 5 — Next Run Improves: Before the next run starts, the memory system retrieves matching AgentSummary records and injects them into the planner. Here's what that injection looks like at the start of Round 2:

## Previous Run Context
Score: 71/100 · Verdict: good

Lessons learned:
- Iron Condors on SPY/QQQ survived the 2022 bear market
- Multi-year average 54.3% — consistent across all regimes
- Max drawdown 59% exceeded the 50% threshold

Next iteration focus:
  Enhance return profile while holding the 2022 floor.
  Do not abandon the Iron Condor structure.

---
## Your Task
Build an options strategy for a $25,000 account...

Run → Trace → Judge → Memory → Improve → repeat. Each loop produces a labeled training example: what the agent did, how it scored, and what to do differently next time.

A demo runs once. A system runs, gets graded, and improves — or degrades, depending on what you told it to optimize for.

Connect

Expose your agent to the world: the NexusTrade MCP server.

Everything in this article — traces, LLM judges, feedback loops — is built on Aurora's infrastructure and accessible via MCP from Claude Desktop or Cursor.

{
  "mcpServers": {
    "nexustrade": {
      "url": "https://nexustrade.io/api/mcp",
      "headers": { "Authorization": "Bearer <your-api-key>" }
    }
  }
}

From Claude Desktop or Cursor, you can ask: "Use NexusTrade to screen for stocks with RSI below 40 trading above their 200-day moving average." Claude calls screen_stocks on the NexusTrade MCP server. The server returns live results from the same screener Aurora uses internally. You can also run the evaluator directly — run_agent_run_evaluator(agent_id) grades any completed agent run from any client.

Check Your Understanding

Pop quiz.

Q: What is a trace in the context of AI agents?

A: A trace is a structured log of every step the agent took: inputs, outputs, tool calls, results, costs, and timing. Think of it like a flight recorder — you can reconstruct exactly what happened, step by step, to debug issues or evaluate performance. Without it, a failed run is a black box.

Q: An agent produces a 126% backtest return in 2024, but you only have one year of data. The evaluator gives it a high score. What's wrong?

A: Single-year outlier returns are the textbook signature of overfitting. The agent may have memorized 2024-specific anomalies that won't repeat in live trading. A trustworthy evaluator caps the score based on multi-year average return and requires evidence across at least two distinct market regimes — including a bear market. One year of 126% is not evidence of a durable edge.

Q: When would you use an LLM judge instead of an algorithmic evaluator?

A: When you need to evaluate subjective criteria that are hard to measure with code — did the agent explain its reasoning clearly? Did it test fundamentally different strategy types? Objective metrics like cost, iteration count, and Sharpe ratio should use algorithmic evaluation. Code is faster, cheaper, and deterministic for anything you can count.

Q: True or false: the evaluation feedback loop requires both evaluation AND memory to work. Having only one of the two is not enough.

A: True. Without evaluation, the agent has no signal for what "better" means. Without memory, the agent can't retain the lessons it learned — every run starts from scratch. Evaluation produces the signal. Memory carries it forward.

The End

You've built the complete agent. This is where most people stop. It's where you start.

Five articles. Five modules. You started with a leaked source file and ended with a production evaluation loop — router, ReAct engine, long-term memory, rubric design, feedback loop. The only thing left is to run it.

The $676 hill climbing experiment is a better argument for evaluation than anything else in this article. The agent wasn't broken. The rubric was. Build the right rubric and the loop compounds toward something real. Build the wrong one and a perfectly obedient agent will follow it off a cliff, five rounds in a row, at $135 per run.

Before the capstone, here's the full series in five minutes:

Module 6 is where the capstone lives. Aurora runs inline: screens the market, validates with news, builds a watchlist, then wires up a scheduled agent to manage it every week.

Run the Capstone — free, no credit card

Or open Aurora directly

Part 5 of 5 in the AI Agents from Scratch series.

Try NexusTrade's AI trading agent free: https://nexustrade.io

Cursor beats Claude Code. Here's the memory architecture that proves it.

Austin Starks — Mon, 13 Apr 2026 17:06:35 +0000

Note from the author: You're reading a Dev.to adaptation. The original on NexusTrade includes interactive trace viewers, animated diagrams, equity curve visualizations, and embedded course exercises. Read it there for the full experience.

If you are a serious AI practitioner, you know that Cursor is better than Claude Code.

The surveys disagree for now, but they won't for long. The Pragmatic Engineer's March 2026 survey found Claude Code has 46% developer love vs Cursor's 19%. Among the 55% of developers who regularly use AI agents, Claude Code is the clear leader at 71% usage.

Claude Code has real advantages. The Max plan gives you parallel agents with high usage limits. It's token-efficient, using 50-75% fewer tokens than older Anthropic models. Its SWE-bench scores are genuinely better.

But I use Cursor. The UX is in a different league. Composer 2 is fast, maintains coherence across hundreds of actions, plans across files before writing a single line. It lets me switch models per task. And most importantly, it does a better job managing memory. Cursor's memory is your codebase, and your codebase doesn't lie.

That last point is the one most people argue about without understanding the engineering underneath. This article is about the engineering underneath.

The problem every agent builder hits

The base model has no memory. You knew that. But the implications for agents are worse than they sound.

Every time you start a new session, the agent starts from zero. It doesn't know what it built last week. It doesn't know which strategies it already tested. It doesn't know that you told it three sessions ago to always use 3-year backtests instead of 1-year. It has to rediscover everything.

In practice, that means the agent wastes your iterations on exploration that already happened. When Aurora launched without memory, it took 20+ iterations just to get to a half-decent strategy on every single run. The agent was smart. It just couldn't remember. Every run was a cold start.

Memory is the engineering layer that fixes this. The question is how.

Q: Why does an AI agent forget everything between sessions, even if it performed well the last time?

A: Large language models are stateless by design. They process the tokens in the current context window and produce output. When the session ends, nothing persists inside the model. Anything the agent "learned" during the run only existed in the conversation context, which is gone. Memory is an external system built on top of the stateless model, not something the model itself provides.

Four ways to give an agent memory. Only one scales.

There's a spectrum of approaches, from naive to production-grade. Most tutorials only cover the first two. Here's all four, including what Cursor and Claude Code actually do.

Approach	How it works	Who uses it
In-context (file dump)	Copy the entire conversation into a file. Prepend it to the next session's prompt.	Toy projects, short sessions
LLM-maintained notes (Dream Mode)	Model reads the session and writes structured notes to organized markdown files. Consolidates during idle time.	Claude Code
Vector / RAG	Documents converted to embeddings. Semantic similarity search retrieves the most relevant chunks.	Cursor (codebase), many production apps
Structured DB + typed queries	Typed memory records with domain fields. LLM generates DB query fields. Retrieval is targeted, not fuzzy.	NexusTrade (Aurora)

How Claude Code actually handles memory: no vector database

When Anthropic accidentally shipped Claude Code's source code in March 2026, the most surprising finding wasn't the hidden virtual pet system or the "undercover mode." It was the memory architecture.

It doesn't use RAG or Pinecone. It's plain markdown files in a directory with a 25KB index cap. Anthropic invested in the maintenance loop instead of the storage layer.

The interesting part is Dream Mode: a 4-phase consolidation loop that runs during idle time.

Phase 1 — Orient: Reads the ENTRYPOINT.md index to understand what's already stored.

# Memory Index

## user-preferences.md
Prefers TypeScript. Strict mode always.
Dislikes verbose comments.

## project-context.md
Working on NexusTrade agent memory.
Stack: Node, MongoDB, Redis.

## corrections.md
2026-03-14: No any-type in TS.
2026-04-01: Short paragraphs preferred.

[index: 18.4 KB / 25 KB cap]

Phase 2 — Gather: Greps transcripts narrowly. The prompt is explicit: don't read everything. Look only for things you already suspect matter.

grep -n "prefer|always|never|told" ~/.claude/logs/2026-04-12.md

→ line 47:  "I prefer snake_case for vars"
→ line 203: "never use console.log in prod"
→ line 891: "always add JSDoc to exports"

# Full transcript: ignored.

Phase 3 — Consolidate: Merges new findings into existing notes. Converts relative dates to absolute. Deletes contradictions when new information overrides old.

BEFORE:
  "Yesterday: prefers short paragraphs"
  "Told to avoid console.log"
  "Prefers verbose comments"  ← stale

AFTER:
  "2026-04-11: short paragraphs"
  "never use console.log in prod"
# "verbose comments" deleted — contradicted by newer entry

Phase 4 — Prune: Enforces the 25KB cap. Removes stale pointers and compresses low-priority entries.

measure(ENTRYPOINT.md)  → 27.2 KB  (over cap)
remove_stale_pointers() → 25.8 KB
compress_entries(priority="low", n=3) → 24.1 KB  ✓
# Loop complete. Next run: next idle window.

The prompt literally says: "Don't exhaustively read transcripts. Look only for things you already suspect matter."

LLMs are already good at reading and writing text. The hard part of memory isn't storage. It's maintenance: keeping notes accurate, consolidated, and bounded. That's what Dream Mode solves, and it works.

There's a deeper limitation too. Dream Mode memory is an LLM writing notes about what it thinks happened. Those notes can drift. They can miss things. And as you'll see, they can be poisoned when the underlying data is wrong.

Cursor's memory is your codebase. It doesn't interpret. It doesn't summarize. It indexes what's actually there. For software development, that's the more trustworthy foundation.

How Cursor handles memory: a vector index of your codebase

Cursor solves a different problem than Claude Code. It's not trying to remember what you told it last week. It's trying to navigate a codebase it has never seen, at a scale where reading every file would be prohibitively expensive.

The architecture: Cursor uses tree-sitter to parse code into Abstract Syntax Trees, creating semantic chunks (classes, methods, functions). Those chunks get converted to vector embeddings and stored in Turbopuffer, a specialized vector database. Change detection uses Merkle trees to identify exactly which files changed without re-indexing the whole codebase.

If you haven't worked with vector embeddings before: an embedding model converts any piece of content into a list of numbers that captures its meaning. Things that mean similar things end up close together in that space. At query time, your question gets embedded the same way, and the database returns whichever stored chunks are mathematically nearest to it.

Cursor runs this same pipeline across your entire codebase. The indexing pipeline in five steps:

Step 1 — Parse: tree-sitter walks every source file and produces an Abstract Syntax Tree.

tree = parser.parse(source_file)

for node in tree.walk():
  if node.type in ['function_definition', 'class_definition', 'method_definition']:
    chunks.append({
      "text": node.text,
      "type": node.type,
      "file": file_path,
      "lines": (node.start, node.end)
    })

Step 2 — Chunk: Each AST node becomes an independent chunk with metadata. One function = one chunk.

Step 3 — Embed: Each chunk's text gets converted to a high-dimensional vector stored in Turbopuffer. Two chunks doing similar things will be close in vector space even if they share no words.

vector = embed_model.encode(chunk["text"])
# vector = [0.021, -0.14, 0.88, ...]  (1536 dimensions)

turbopuffer.upsert(id=chunk["id"], vector=vector, metadata={...})

Step 4 — Diff: When you save a file, Cursor hashes it and compares against a Merkle tree. Only changed files get re-chunked and re-embedded.

Query time — Retrieve: Your question gets embedded the same way. Turbopuffer finds the top-K closest chunks and injects them into the model context.

query_vec = embed_model.encode("how does auth middleware work?")
results = turbopuffer.query(vector=query_vec, top_k=20, filters={"language": "typescript"})
# inject results into model prompt

This is why Cursor wins for active development. If you ask it about a function defined three directories away, it finds it. The memory is your codebase, indexed semantically, updated incrementally.

The "rules" system (.cursor/rules/) is the persistent context layer on top: project conventions, coding standards, architectural decisions. Unlike Dream Mode, Cursor doesn't write these for you. You write them.

Claude Code vs. Cursor — memory comparison:

Capability	Claude Code	Cursor
Storage format	Markdown files (25KB index cap)	Vector embeddings (Turbopuffer)
Auto-writes memory	Yes (Dream Mode, idle consolidation)	No (you write the rules)
Codebase navigation	Limited (text search, no semantic index)	Excellent (AST parsing + semantic retrieval)
Cross-session learning	Yes (Dream Mode carries forward)	Partial (rules persist, embeddings persist)
Memory poisoning risk	Yes (bad sessions corrupt future context)	Low (codebase is source of truth)

Q: What is retrieval augmented generation (RAG) and when do you actually need it?

A: RAG is the pattern of storing documents as vector embeddings, then at query time retrieving the most semantically similar chunks and injecting them into the model context. You need it when your knowledge base is too large to fit in a context window, and when you need semantic similarity search across unstructured text. You don't need it when your knowledge is structured (a regular database query is faster and more precise), or when your documents are small enough to fit in the prompt directly.

What NexusTrade does differently: structured memory with semantic queries

Vector databases are great for fuzzy text matching: find me code that does authentication. They are terrible at structured constraints: find me only backtests for NVDA where the Sharpe ratio was above 1.5. For trading, memory isn't fuzzy. It's math.

NexusTrade stores memory as typed AgentSummary records in MongoDB. After each run (or at iteration 20, when the context window summarizes), the agent writes a structured document.

Each document has: semanticInsights (up to 24 deduplicated patterns from this run), proceduralLessons (up to 12 meta-lessons about how to run better next time), and structured portfolio records with tickers, strategy type, instrument type, and backtest metrics per period.

Before a new agent run, a separate fast LLM call reads the current conversation and extracts structured MongoDB query fields: which tickers are relevant, which strategy types, equity or options, any keywords. Then targeted retrieval injects only matching past summaries into the planner.

If you're asking about NVDA options, it pulls past NVDA options runs. If you're asking about momentum strategies, it pulls past momentum runs. The retrieval is typed, not fuzzy.

The result: agents that start sessions already knowing what worked, what failed, and what to avoid. Cold starts went from 20+ iterations to a fraction of that.

The part nobody talks about: memory poisoning

War story — the options backtest bug: During the options trading beta, I had a bug in the spread backtest engine. Credit spreads were being calculated incorrectly — sometimes showing catastrophic losses on strategies that would have been fine, sometimes showing phantom gains on strategies that would have lost money. The agent ran. It learned from the results. It wrote those lessons into memory. "Bull call spreads on META led to catastrophic losses." "Mean-reversion spreads on NVDA are unreliable." None of it was true. The strategies weren't bad. The backtester was wrong. The next runs came in already poisoned. The agent avoided spreads entirely — it was confident about it. It had "evidence." The fix was insightsPipelineVersion. Every AgentSummary document is written with the current pipeline version number. Retrieval only returns documents matching the current version. Bump the version and every old document goes silent instantly — still in the database for analytics, but they stop matching the filter. We're now on version 6. Each bump corresponds to a backtesting bug fix that would have corrupted agent memory if old summaries kept being injected.

This is the risk neither Cursor nor Claude Code faces at the same level. When your memory is downstream of a system that can be wrong, you need a versioning mechanism. A flat markdown file doesn't give you one.

Memory that makes the agent better over time

Storing memory is one thing. Using it to improve is another. NexusTrade runs a background worker that scores every completed agent run, extracts what worked, and injects those patterns into the next run automatically.

A prompt enhancer queries top performers and extracts their successful tool sequences. A planner enhancer finds few-shot examples relevant to the current task by keyword. Both cache results and inject learned patterns into future prompts automatically. Most agents plateau at run 1. The same output on run 50 as run 1, because nothing carries forward. This is the gap between a demo and a system.

Reading about memory isn't the same as using it.

The concepts in this article are straightforward on paper. File dumps, LLM summarization, vector embeddings, structured queries. But the questions that actually matter only come up when you run it: what does Aurora remember from your past sessions? What gets injected, and what gets filtered out?

Module 4 of AI Agents from Scratch puts you in the system directly. You query Aurora's memory and see what it has stored from your past runs. You watch the LLM read your prompt, generate query fields, pull matching summaries, and inject them before the agent starts.

You'll see exactly what the agent knows before it knows what you're about to ask. This is not a simulation. It's the live production system, and what it has stored is specific to you.

Start Module 4 — free, no credit card

Or open Aurora directly

Part 4 of 5 in the AI Agents from Scratch series.

Try NexusTrade's AI trading agent free: https://nexustrade.io

Coinbase calls their chatbot an agent. I got fired for building a real one.

Austin Starks — Mon, 13 Apr 2026 17:01:01 +0000

Note from the author: You're reading a Dev.to adaptation. The original on NexusTrade includes interactive trace viewers, animated diagrams, equity curve visualizations, and embedded course exercises. Read it there for the full experience.

Coinbase fired me for building something they couldn't.

Coinbase Advisor's own FAQ asks it directly: "Does the AI make trades on my behalf?" The answer: No. It answers your questions. It suggests a portfolio. Then it waits for you to click. That's the architecture of a chatbot: one question, one answer, you do the rest.

NexusTrade is fundamentally different. It runs a loop. You send one message. Aurora figures out what it needs to know, calls tools, reads the results, and keeps going until the task is done. No clicking through each step. No waiting for your approval on every action unless you want it.

Tell Aurora to research a stock, build a strategy, backtest it across three market regimes, and stage it for live trading. It chains all of that together on its own. Whether trades execute automatically or require your sign-off is up to you.

The Actual Difference

Coinbase built a 24/7 AI advisor. I built an autonomous agent. Here's the gap.

Coinbase's marketing for Advisor is well-written. "Elite financial advice, democratized." "Turn your questions into actionable financial plans." Strong copy. The product behind it is a chatbot that generates recommendations you then manually execute.

A team of engineers and millions of dollars, and they built a chatbot that waits for you to click. Here's what that buys you versus an actual agent:

Capability	Coinbase Advisor	NexusTrade (Aurora)
Answers financial questions	Yes	Yes
Executes trades autonomously	No, requires explicit approval	Yes, fully automated mode
Multi-step task chaining	No	Yes, up to 50 iterations
Builds and backtests a strategy	No	Yes, in a single agent run
Spawns subagents for parallel work	No	Yes
Human-in-the-loop approval controls	Required for all actions	Optional, per-action toggle

The bottom row is the one that matters. Coinbase Advisor requires approval for everything because it has no loop. It answers, then stops. NexusTrade makes approval optional because Aurora has a loop. It can keep going on its own, and the approval controls let you decide how much of that autonomy you want.

The loop is the product. Everything else is a feature.

The Problem

A chatbot answers once. That's not enough.

You ask a chatbot: "Build me a momentum strategy and backtest it." It generates a paragraph describing what a momentum strategy should look like. Then it stops. If you want the backtest, you take what it said and do the work yourself.

That's the one-shot problem. A language model answers the question it's given. It doesn't ask the next question, run the next tool, or check if the answer it gave was actually correct. It responds and waits.

For a lot of tasks, that's fine. Answering a question is useful. But there's a whole class of work that requires chaining actions together: research a stock, pull its indicators, build a strategy based on what you find, backtest it, check if it survived 2022, adjust the parameters, backtest again. No single response handles that. You need something that runs until the task is done.

You need a loop.

Q: What is the difference between a chatbot and an AI agent?

A: A chatbot responds once and waits for you. An AI agent runs a loop: think, act, observe, repeat. It chains multiple tool calls together until the task is done without you directing each step.

The Loop

Thought. Action. Observation. Repeat.

In 2022, a research team at Google published a paper called ReAct: Synergizing Reasoning and Acting in Language Models. The core idea was simple: instead of asking a model to produce a final answer, ask it to produce a thought, then an action, then read the result, then think again. Repeat until done.

Thought → Action → Observation is the pattern every agent today runs on. Cursor uses it when you ask it to refactor a file and it reads the file, makes an edit, checks the diff, and continues. Claude Code uses it when it plans a multi-step task, runs each step, observes the output, and adjusts. Aurora uses it every time you send it a complex request.

That loop is not a metaphor. It's a while loop in production code. Aurora runs it on a background worker that polls every 500ms. Each iteration increments a counter, calls the model, executes the tool, saves the result, and goes again until the model sets finalAnswer or the iteration limit is hit.

# the ReAct loop, stripped to its core
messages = [system_prompt, user_task]  # full conversation history

while iteration < max_iterations:
    output  = llm.complete(messages)   # Thought → Action → Input

    if output.action == "final_answer":
        return output.answer           # task complete

    result  = tools[output.action](output.input)  # execute tool
    messages += [output, result]       # model sees what it did + what happened
    iteration += 1

return summarize(messages)  # hit the limit — compress + return best answer so far

LangChain runs that same while loop internally:

from langchain_openai import ChatOpenAI
from langchain.agents import create_react_agent, AgentExecutor
from langchain import hub

llm   = ChatOpenAI(model="gpt-4o-mini")
tools = [stock_screener, backtest_tool, create_strategy]

# ReAct prompt tells the model to output Thought / Action / Observation
prompt   = hub.pull("hwchase17/react")
agent    = create_react_agent(llm, tools, prompt)
executor = AgentExecutor(agent=agent, tools=tools, verbose=True)

result = executor.invoke({
    "input": "Research NVDA and build a momentum strategy"
})
# verbose=True prints each Thought / Action / Observation to the terminal.
# You'll see the while loop running in real time.

Aurora implements the same loop directly, without LangChain, on a TypeScript background worker with a state machine that handles summarization, parallel subagents, and iteration limits. The pattern is identical. The infrastructure is purpose-built for trading. LangChain abstracts the loop, but hides the cost controls — per-iteration token metering, configurable summarization thresholds, and circuit breakers that halt a runaway agent are not concerns it was designed to handle.

Production Reality

What it actually looks like when Aurora runs.

The screenshots below are from a real Aurora session. The task: build a fully autonomous 0DTE SPY options bot on a $25,000 account.

Before the ReAct loop starts, a separate step runs first: the planner. This is a specialized prompt, distinct from the main loop, that takes the user's request, reasons about what it needs to know, and generates a structured plan. It's not iteration 1. It's the step before iteration 1. Aurora asks clarifying questions here, not because the loop told it to, but because the planner is designed to gather everything the loop will need before the first tool call fires:

Once the planner has what it needs, it produces a structured plan and hands it to the main agent. The ReAct loop starts. This is the Thought + Action visible in the approval modal:

After that action runs, the result comes back as an observation. The agent reads it, generates the next Thought, and continues. Here's the SPY regime data Aurora used to decide which options hypothesis to test first:

And here's the plan Aurora generated for the full strategy before calling any tool:

War story — $300 in one day: When I launched strategy-triggered agents, I made one mistake: I didn't charge tokens for the planning phase. The planner is a separate LLM call that happens before the loop. I had token checks on the iterations. The planner, at launch, was free. A user configured a strategy with conditions that were almost always true, set to trigger as frequently as possible. Every time the condition fired: plan, stop (out of tokens), fire again. Plan. Stop. Fire. $300 in API costs. One day. Hundreds of planning calls. None made it past iteration 1. The fix was two lines: deduct tokens before the planner runs, and deactivate the strategy if the user can't afford it. The step before the loop is just as real as the loop. Gate them all.

Autonomy Controls

Fully automated or semi-automated. You choose how much to trust it.

Every agent system has to answer one question: how much should the agent do on its own before checking in with a human? The naive answer is "let it run." The production answer is: every tool your agent can call should be explicitly approved for autonomous execution, or it should require human sign-off first.

Think of it as a whitelist. Some tools are cheap, fast, and reversible: reading market data, running a screener, generating a plan. Those can run without asking. Others are expensive, slow, or irreversible: submitting a live trade, deploying a strategy, deleting something. Those should pause and wait. The toggle in Aurora's UI is the implementation of that concept.

Aurora has two modes:

Fully Automated — runs the entire loop without stopping. Every tool call executes immediately. You see the results when it's done.
Semi-Automated — pauses before every action for your approval. You see the Thought and the proposed Action before anything executes.

In semi-automated mode, Aurora shows you its Thought and proposed Action before running any tool. You can approve, reject with feedback, or switch to fully automated if you've seen enough to trust it.

This is what "human-in-the-loop" actually means in production. It's not a philosophical stance about AI safety. It's a checkbox in the UI. Most experienced users start in semi-automated mode to verify the plan, then switch to fully automated once they trust the direction.

Q: Why use subagents instead of giving one agent all the tools and letting it run everything?

A: Context window limits. By iteration 15-20, thousands of tokens of Thought/Action/Observation history are accumulating and reasoning quality drops. Subagents keep each context small and focused on one task.

The Engineering Problem Nobody Talks About

Long loops make agents dumb. Here's how Aurora solves it.

The ReAct loop has a problem that shows up around iteration 15-20: the context window fills up. Every Thought, Action, and Observation gets appended to the conversation. By the time you're 20 iterations deep, the model is attending to thousands of tokens just to decide what to do next. Reasoning quality drops. The agent starts making worse decisions.

The standard advice is "use subagents to keep contexts small." That's true and Aurora does it. But there's a second mechanism that's less talked about: conversation summarization.

At iteration 20, Aurora doesn't stop. It summarizes. The model compresses everything it has learned: findings, portfolios created, what worked, what didn't. That summary becomes the context for a new conversation. The loop restarts with a clean window and the knowledge of everything that came before.

The hard cap is separate: totalIterations accumulates across all conversation resets. When you hit your configured maximum (default: 20 total, up to 50 on premium), the agent stops and delivers a final answer with whatever it accomplished. The summarization is how it stays sharp. The hard cap is how you control cost.

Aurora also queries its own memory across sessions. If you've run five agent tasks this week, it can synthesize findings from all of them and bring relevant context into a new run. That's not the base model, which has no memory. That's the app layer, built on top of a stateless model, giving it continuity across sessions that the model itself can't have.

Module 3

Reading about the loop is not the same as running it.

The ReAct loop looks simple on paper. In practice, the interesting questions are the ones that only come up when you run it: Why did it pick that tool instead of a different one? Why did the reasoning change after iteration 5? What happens when a tool call fails?

Module 3 of AI Agents from Scratch puts you in the loop directly. You send Aurora a real task in fully automated mode and watch each Thought → Action → Observation cycle play out in real time. Then you switch to semi-automated and approve or reject actions one at a time.

You'll also see what happens when context starts to accumulate, when the summarization triggers, and what the compressed summary actually looks like. These aren't simulations. It's the live production system.

Start Module 3 — free, no credit card

Or open Aurora directly

Part 3 of 5 in the AI Agents from Scratch series.

Try NexusTrade's AI trading agent free: https://nexustrade.io

Everyone thinks ChatGPT is an AI agent. It's not.

Austin Starks — Mon, 13 Apr 2026 16:21:39 +0000

Note from the author: You're reading a Dev.to adaptation. The original on NexusTrade includes interactive trace viewers, animated diagrams, equity curve visualizations, and embedded course exercises. Read it there for the full experience.

Everyone thinks ChatGPT is an AI agent. It isn't.

It's a chatbot with tools. And that difference is the reason most "AI agent" startups don't actually work.

The distinction isn't semantic. It changes what you can build, what breaks, and why. If you're building an agent, evaluating one, or wondering why the product you're using doesn't do what it claims, this is the answer.

Start Here

A language model knows nothing. That's by design.

A raw language model is stateless. It has no memory of you. It doesn't know what happened in markets today. It can't look anything up. All it can do is take whatever text you hand it and predict what should come next.

That sounds limiting. It is. But it's also the foundation everything else builds on. The OpenAI Playground is the closest thing to a language model in its purest form. No apps layered on top. No tools. Just a system prompt, a conversation, and a model responding to exactly what you give it.

The OpenAI Playground shows you the raw model. No tools. No memory. No app layer. Ask it your name and it doesn't know. Give it your name in the system prompt and now it knows. Everything the model knows in a given conversation came from somewhere in the prompt: system message, user message, or tool results. Nothing else.

ChatGPT is an app built on top of that model. It knows your name because it has memory. It can search the web because it has tools. Those things aren't the model. They're layers the app added. Strip them away and you're back to the Playground.

ChatGPT still operates as a back-and-forth conversation where you're the one directing every move. That's a chatbot. An agent is something that can direct itself.

An agent runs a loop. It thinks, picks an action, executes it through a tool, observes the result, and repeats until the task is done or it can't continue. You don't direct each step. The agent does. That loop is what separates it from every chatbot you've ever used. Tools and system prompts are how you build the loop. The loop is what makes it an agent.

Q: What is the difference between ChatGPT and an AI agent?

A: ChatGPT responds once per message and waits for you to reply. An AI agent runs a loop: it thinks, takes an action through a tool, observes the result, and decides what to do next — all without waiting for you to direct each step.

System Prompts

The instructions the user never sees.

Before any user message reaches a language model in a production app, there's a system prompt. It runs first, every time. It tells the model who it is, what it can do, what format to respond in, and how to handle edge cases.

A well-designed system prompt isn't a paragraph of vague instructions. It has structure: an identity section, explicit directives, data sources or context, examples of correct behavior, and output format rules. The model's responses are only as good as the system prompt shaping them.

# INSTRUCTIONS
You are Aurora, an AI trading assistant for NexusTrade.
You help users build, backtest, and manage trading strategies.
Always respond in JSON. forceJSON: true.
Never recommend a specific stock without a supporting backtest.
If the request is ambiguous, ask one clarifying question before proceeding.

# EXAMPLES
User: "I want to back test a trading strategy"
Assistant: {"tool": "backtest", "portfolio_id": "...", "start": "2022-01-01", "end": "2024-01-01"}

User: "Screen for high momentum stocks"
Assistant: {"tool": "screener", "query": "SELECT ticker FROM stocks WHERE rsi_14 > 70 ORDER BY momentum DESC"}

# OUTPUT FORMAT
Always respond in syntactically valid JSON.
No markdown fences. No explanation unless explicitly asked.
Schema: {"tool": string, "parameters": object}

What makes that system prompt work? Each section has a specific job. Instructions pin the model's identity and hard constraints. If it's not written down, the model will invent behavior. Examples show the model what correct output looks like without having to explain it in prose — one good example beats three paragraphs of description. Output format eliminates ambiguity about structure. Without it, the model might respond in JSON sometimes and plain text other times, and your parser breaks.

The bad version of this prompt is four words: "You are a trading assistant." The model will try to be helpful and will fail in unpredictable ways. No output contract means you'll get markdown one response and raw JSON the next. No examples means the model guesses what "backtest" should return. No constraints means it'll recommend NVDA when it shouldn't, apologize when it doesn't need to, and ask five clarifying questions instead of one. Every missing line is a failure mode you'll discover in production.

Prompt engineering is designing the instructions that run silently before the user types anything. In production, that's the difference between an AI that does what you need and one that does something close but wrong in ways you can't predict.

Zero-shot vs. one-shot — same prompt, different parser:

Zero-shot (no examples)	One-shot (one example added)
Output format is inconsistent — sometimes JSON, sometimes prose	Output format is reliable — the model mirrors the example
Edge cases produce unpredictable structures	Edge cases degrade gracefully
Parser breaks in production	Parser handles it

From the course: In Module 2's first exercise, you build a real system prompt from scratch and run it against Gemini using a token grant we give you. You write the instructions, the examples, and the output format rules. Then you render it and see exactly what the model receives. Most people have never seen a production system prompt in full.

Tools

The AI doesn't execute anything. Your code does.

Here's the thing most people get wrong about AI agents: the model doesn't actually do anything. It generates text. Your system reads that text, figures out what to do with it, and executes the action. The result comes back. The model sees it and continues.

That's a tool call. The model outputs a structured JSON object that describes what it wants to do. Your code parses the JSON and runs the actual function. Nothing happens until your system does something with the output.

A concrete example. If the model outputs this:

{
  "tool": "backtest",
  "portfolio_id": "abc123",
  "start_date": "2022-01-01",
  "end_date": "2024-01-01"
}

The JSON itself does nothing. Your system reads it, calls the backtest API with those parameters, gets the results, and feeds them back into the conversation. Now the model can see what happened and decide what to do next.

This is why "the AI is doing it" is a slightly misleading frame. The AI is deciding what to do. Your infrastructure is doing it. The distinction matters because it means every tool an agent has is something a human explicitly built and wired up. Agents don't gain new capabilities on their own.

Q: An AI agent outputs a tool call to "buy 10 shares of AAPL." What actually executes the trade?

A: Your code does. The agent generates a JSON object describing what it wants. Your system parses it, calls the brokerage API with those parameters, and returns the result. The model never touches the market directly.

Production Reality

How this scales: 23 sub-prompts and one classifier.

Once you understand system prompts and tools, you can build an agent that does one thing well. The harder problem is building one that does many things well without the system prompt becoming impossible to maintain.

The answer most production apps land on is the same: don't build one giant prompt. Build many focused ones and route between them.

War story — Aurora V1 (2023): The first version ran on GPT-3. 2,048-token context window. One giant prompt, but the output window was so small it couldn't generate a full portfolio object in a single call. So I chained three separate prompts: portfolio → conditions → actions → an orchestration step to stitch the pieces together. JSON mode didn't exist yet. I'd instruct the model to respond in JSON, it would partially comply, I'd parse the output, watch it fail, then retry up to three times with a message explaining exactly where the JSON broke. Every prompt was a hardcoded string in the source code. Changing one instruction meant a code deploy. Aurora v1 did exactly one thing: create portfolios. That's it.

The classifier exists because I built that version first.

The controller is the decision layer that sits between the user and every sub-agent. In NexusTrade, every message you send to Aurora hits the classifier first. It reads your message and a list of 23 specialized sub-prompts, each with its own description. It picks the one that should handle your request and routes to it. That sub-prompt has a tight system prompt, a narrow tool list, and examples specific to its job. The main model only ever sees one task at a time.

The classifier is gemini-3.1-flash-lite-001 at temperature: 0 with forceJSON: true. Fast, cheap, deterministic. It runs on every message. The expensive models only run when a message reaches them.

Four engineering reasons this wins over a single giant prompt:

Focus. Each sub-prompt sees only the tools and instructions relevant to its task. The model isn't confused by 200 rules that don't apply.
Debuggability. When a route breaks, you know exactly which sub-prompt to fix. No hunting through a monolith.
Incremental scaling. Add a new capability by writing a new sub-prompt and a trigger description. Nothing else changes.
Cost control. Only the matched sub-prompt runs against the expensive model. The classifier is cheap by design.

This is the architecture almost every production AI app at scale converges on. ChatGPT's Custom GPTs are sub-prompts. Claude's Projects are sub-prompts. Cursor routes your request before invoking the right tool. You've been using this pattern without knowing what to call it.

One More Thing

MCP: the same concept with a standard interface.

The AI industry has a naming problem. Function calling, tool use, skills, MCP servers. They all describe the same core concept: a list of things the agent is allowed to do, with defined inputs and outputs, so it can generate parameters and your system can execute the call.

MCP (Model Context Protocol) is Anthropic's open standard for this. Think of it as USB for AI agents. Before USB, every device had its own connector. MCP creates one standard so any agent can connect to any tool that exposes an MCP server.

NexusTrade runs an MCP server. Here's what that actually looks like in practice. You add one entry to your Claude Desktop config:

{
  "mcpServers": {
    "nexustrade": {
      "url": "https://nexustrade.io/api/mcp",
      "headers": { "Authorization": "Bearer <your-api-key>" }
    }
  }
}

That's it. After that, you open Claude Desktop and ask: "What's the current RSI of NVDA?" — Claude calls screen_stocks on the NexusTrade MCP server, the server returns the live value, Claude reads it and responds with the number and what it means in context. The same tool engine Aurora uses inside NexusTrade. No copy-paste. No API docs. One tool implementation, available from any MCP-compatible client.

The name changes depending on the ecosystem. The pattern doesn't.

Module 2

Reading this isn't enough.

Reading about system prompts and writing one that works are different skills. Understanding the classifier pattern and knowing where it breaks are different things. The only way to close that gap is to build something and watch it fail.

Module 2 has two exercises built around this. In the first, you write a real system prompt from scratch (instructions, examples, output format) and render it against a live Gemini model using tokens we give you. You see exactly what the model receives and how it responds. In the second, you run the real NexusTrade classifier. You read the sub-prompt descriptions. You type messages and watch them route. Then you try to find edge cases that break it.

Both exercises use real infrastructure. Real models. Real NexusTrade prompts. Nothing is simulated.

Start Module 2 — free, no credit card

Part 2 of 5 in the AI Agents from Scratch series.

Try NexusTrade's AI trading agent free: https://nexustrade.io

Claude Code's source code just leaked. Today I'm going to teach you how it works.

Austin Starks — Mon, 13 Apr 2026 16:21:31 +0000

Note from the author: You're reading a Dev.to adaptation. The original on NexusTrade includes interactive trace viewers, animated diagrams, equity curve visualizations, and embedded course exercises. Read it there for the full experience.

Why Now

Claude Code's source code just leaked. Half a million lines. And most people reading it have no idea what they're looking at.

On March 31, 2026, Anthropic accidentally published a source map in the Claude Code npm package that exposed the full TypeScript source. Over 512,000 lines. Gone public.

The internet went crazy.

Engineers are posting breakdowns. Developers are building clones. Everyone's asking the same question: how does this thing actually work?

If you already understood how AI agents are built, you'd know exactly what you're looking at.

That's the gap. And that's what this course closes.

"Building an AI agent isn't nearly as hard as you think."

I know this because I built one from scratch before most people knew the term. Ocebot was an autonomous claim adjudication agent I built at Oscar Health. It was running real healthcare workflows in production while the industry was still debating whether agents were real. Before that, I got a master's in software engineering from Carnegie Mellon. Since then I've interviewed at OpenAI and Meta for AI-native roles, and watched this exact architecture become the standard for what they test.

What It Means

The leak didn't reveal magic. It revealed a loop.

Everyone who's read the breakdown knows the same thing: at the center of Claude Code is a loop. The model thinks. It picks a tool. The tool runs. The result comes back. Repeat until done.

That loop has a name. It's called ReAct, from a 2022 paper out of Google Research. Cursor runs on it. Claude Code runs on it. Every production AI agent you've used runs some version of it.

Here's the actual structure from src/query.ts, the 1,729-line file at the heart of the leak. Most people miss the key detail: tools start executing while the model is still streaming.

// src/query.ts · claude code v2.1.88 · simplified
while (true) {
  // 1. Trim context before every call (4-layer cascade)
  //    HISTORY_SNIP → microcompact → CONTEXT_COLLAPSE → autocompact
  await trimContext(messages);

  // 2. Stream the model response
  const stream = model.streamComplete({ messages, tools });
  const executor = new StreamingToolExecutor(); // RWLock: reads parallel, writes exclusive

  // 3. Dispatch tools IMMEDIATELY as they appear in the stream
  //    Don't wait for the model to finish generating
  for await (const chunk of stream) {
    if (chunk.type === "tool_use") executor.dispatch(chunk);
  }

  // 4. Collect all tool results (some ran in parallel)
  const results = await executor.collect();

  if (results.length === 0) return; // No tools called — model is done

  // 5. Feed results back and go again
  messages.push({ role: "assistant", content: stream.contentBlocks });
  messages.push({ role: "user",      content: results.map(toToolResult) });
}

The loop that makes Cursor feel fast is the same pattern. Read tools (grep, file read, search) run in parallel. Write tools get an exclusive lock. The model doesn't wait for a tool to finish. It's already streaming the next thought while the tool runs.

ReAct: Synergizing Reasoning and Acting in Language Models (arxiv.org, 2022) — the paper behind the loop that powers Cursor, Claude Code, and every production AI agent.

Chatbot vs. Agent: the line that matters

ChatGPT / Copilot	Aurora / Claude Code / Cursor
Responds to your prompt	Runs a loop until the task is done
Starts fresh every message	Reads memory from past sessions
You direct every step	Directs itself through subtasks
Can't use tools autonomously	Calls tools, observes results, decides what's next
Output: text	Output: actions in the real world

The loop is the core. But the loop alone doesn't get you to production. The rest of this series covers what wraps around it: how controllers route decisions, how the orchestration loop manages autonomy without going off the rails, how memory lets an agent improve across sessions instead of starting cold every run, and how you measure whether any of it is actually working.

Most people using Cursor or Claude Code are operating on intuition. They can't explain why it works when it works or why it fails the way it does. They can't evaluate whether the next tool is actually better or just better marketed. That's fine until you're the one expected to build something, own something, or interview for something. Then it matters.

From the course: "I recommend you take a look at the ReAct paper. You don't have to read the entire thing. Drop it in ChatGPT or Claude. Read the abstract. Understand it. Because this is the basis of all AI agents today."

Who This Is For

Three types of people need this course.

You're a SWE or PM at a company that just shipped an "AI roadmap." Your team is using the word "agent." You're expected to own something in this space within the next two quarters and you'd like to understand what you're actually building before that conversation happens.

You have an AI-native systems design interview coming up. Meta, Google, Anthropic, xAI still test databases. Now they layer agent architecture on top: design a system that can reason, remember, and self-correct. This series covers exactly what they test. I know because I went through it.

You use Cursor or Claude Code every day and you've started wondering what's actually happening. Why does it fail the way it does? Why does Cursor feel smarter than Claude Code when both use Claude under the hood? What would you change if you could get inside it? That's what this series answers.

How technical is it? Semi-technical. There are code snippets and diagrams. You are not expected to reproduce them. The goal is understanding how agents work, not memorizing syntax.

The Full Picture

Big tech is now interviewing for four specific things. This series covers all of them.

The AI-Native SWE interview is a different animal. Linked lists are not the bottleneck anymore. The systems design component now includes agents that reason, remember, act, and self-correct. Here's exactly what they're testing, and what every article in this series teaches.

01. Controllers & Prompt Engineering

What actually separates a language model from an agent. System prompts, tool schemas, JSON generation, function calling, and the controller layer that routes every decision. MCP servers are just tools with a standardized interface. You need to know why that matters.

Interview signal: "Walk me through how you'd architect the prompt layer for a multi-tool agent."

02. The ReAct Loop & Orchestration

The Reasoning + Acting loop is the engine behind every serious agent: Claude Code, Cursor, Devin. How it works, how it fails, and how to make it production-safe with approval modes, permission layers, and circuit breakers that stop runaway agents.

Interview signal: "Describe the ReAct pattern. How would you prevent an agent from taking a destructive action?"

03. Memory Architecture

In-context, external, episodic, semantic: four tiers of agent memory and when to use each. Why most agents feel dumb after the first few messages, how retrieval-augmented generation fixes it, and the architecture that lets an agent improve across runs instead of starting cold every time.

Interview signal: "Your agent's context window fills up mid-task. How do you handle it without losing state?"

04. Evaluation & Observability

How do you know if your agent is actually getting better? Traces, LLM judges, weighted rubrics, and the feedback loop that closes the improvement cycle. The optimization trap is real: an agent can score perfectly on your metric and still be completely wrong in production.

Interview signal: "How would you build an automated evaluation pipeline for an agent? What would you measure?"

Pop Quiz

Quick check before you go to Module 2.

Q: What is the main difference between ChatGPT and a bare language model like the OpenAI Playground?

A: ChatGPT wraps the model with tools (web search, code execution) that let it actually do things. The Playground is the model in its purest form. The underlying model can be the same. The difference is the tool layer.

Q: When an AI agent generates a JSON tool call, does the JSON itself perform the action?

A: No. JSON is just text. It doesn't do anything by itself. Your system has to parse that JSON, call the actual function or API with those parameters, and return the result. The model generates instructions. The system executes them.

The Series

Four more articles. Each one will change how you think about AI.

This article is just the setup. The next four cover real architectures, real failures, and the exact decisions that separate agents that work from agents that break.

Everyone thinks ChatGPT is an AI agent. It's not. — The exact architecture that turns a model into something that actually does things.
Coinbase calls their chatbot an agent. I got fired for building a real one. — The orchestration loop that separates the two.
Cursor beats Claude Code. Here's the memory architecture that proves it. — All four memory tiers and why getting this wrong is why most agents feel dumb.
Your AI trading bot will fail because it's optimizing the wrong thing. — Traces, LLM judges, and the evaluation framework that actually catches it.

Part 1 of 5 in the AI Agents from Scratch series.

Try NexusTrade's AI trading agent free: https://nexustrade.io

I was fired from my software engineering job at Coinbase for building an AI trading platform. Here’s what happened.

Austin Starks — Wed, 08 Apr 2026 16:52:09 +0000

When I woke up to the email, I knew I should've prepared for the worst.

I didn't expect this.

"The investigation has concluded and it has found that you violated the Company's Outside Activities Policy. Due to this violation, Coinbase is terminating your employment, effective immediately."

So... matter of factly. Without empathy.

Cruel.

I wasn't the only one that was surprised. My entire team, including my manager, found out the day of. I was suspended, left in the dark, then promptly fired... even though I followed the rules.

Here's what happened.

Jumping for Joy at My Initial Offer

When I got my initial offer from Coinbase, I couldn't stop talking about it. As someone who has been intrinsically passionate about investing for years, this was finally my chance to apply my passion with a big-boy job.

Who wouldn't be excited?

I got my offer contract and read it over and over again. But not just me. I also used ChatGPT and Claude to make sure I wasn't missing anything. There was one section that seemed particularly important.

Ownership of Inventions.

The clause that was supposed to protect me.

It seemed simple enough. Send an email to peopleops and they'll take care of the rest. I signed my offer letter, sent the email, and started my new adventure at Coinbase.

The email I sent to peopleops on September 2, 2025, before my start date. NexusTrade was disclosed.

Until I was contacted by a department that I had never worked with before... Compliance.

The Conflict of Interest

Now, I want to be very clear about where I'm coming from. I was NOT hiding an outside business from Coinbase. If you know anything about me, you know I'm telling the truth.

I talk about my app NexusTrade all of the time. It's on my GitHub profile. It's on my LinkedIn. It's quite literally the only thing I post about on Medium.

It's NOT a secret. It's why I was hired.

My interviewers asked me about it. My resume featured it prominently. Coinbase knew exactly what I was building when they extended the offer. I disclosed it in writing before my first day, exactly as the offer letter instructed.

But unbeknownst to me, Coinbase had become interested in the intersection of AI and trading. In December 2025, they officially launched Coinbase Advisor. I wasn't paying attention when they made the announcement; I was far too busy refactoring our question bank to let our team configure questions through Cursor or Claude Code instead of the old, error-prone manual process. So when I received a DM from Coinbase's Legal Team, I knew something started to feel off.

I kept my cool and cooperated.

I'm not going to bore you with the details. The Coinbase Legal team had me go through the process. I tried to explain the situation. That my app is an algorithmic trading platform, that I'm a "one-man shop", and that there is virtually no overlap with my role working on institutional onboarding.

They didn't care.

I even became desperate. Started offering concessions like open-sourcing the project or removing digital assets.

They accepted none.

"It materially competes with Coinbase Advisor. It doesn't matter that you declared it. It doesn't matter that you've been working on it for five years. Shut it down, sell it, or resign, your choice."

Yeah. What a choice.

The Suspension

They gave me 30 days to shut down the app, sell it, or resign.

I had poured five years of my life into NexusTrade. Shutting it down wasn't an option. Selling it overnight wasn't realistic. So I tried to handle the situation like a professional. If they were effectively forcing me out over a pre-disclosed side project, I asked if we could negotiate a standard severance package so we could part ways amicably.

Their response?

Instead of negotiating, they immediately suspended me. On Wednesday, March 25th, I was placed on leave for sudden "conduct concerns."

The accusation was baseless. I knew it, because I had actually read the Code of Conduct. During the investigation, they asked me questions like whether I had worked on NexusTrade during work hours (no) and whether I had used Coinbase equipment for it (never). I answered everything honestly. Yes, I had continued working on NexusTrade since being employed at Coinbase, but always on my own time, on my own machine.

It felt incredibly retaliatory: a punishment simply for asking for severance after they backed me into a corner. I raised retaliation concerns to HR on Monday, March 30th. I also raised the possibility of racial discrimination. I don't believe a white engineer with a similar side project would have been treated this way: locked out, shut down, not allowed to say goodbye. Coinbase has said that denials of outside activities are "rare." I wonder who's getting denied.

Coinbase fired me one week later.

They dropped the "conduct concerns" framing, switched back to "Outside Activities Violation," and terminated me effective immediately. They didn't even honor their own 30-day deadline. No severance. No transition. My manager didn't even know until the day of.

Here's the timeline:

Before Day 1: Disclosed NexusTrade in writing to peopleops, exactly as the offer letter instructed.
December 2025: Coinbase launches Coinbase Advisor. Coinbase Legal contacts me about NexusTrade. I submit through their review process.
Early 2026: Coinbase determines NexusTrade "materially competes" with Coinbase Advisor. Tells me to shut it down, sell it, or resign within 30 days.
March 24, 2026: I ask for a standard severance package since they're effectively forcing me out.
March 25, 2026: Immediately suspended for "conduct concerns." One day after asking for severance.
March 30, 2026: I raise retaliation and discrimination concerns to HR.
April 6, 2026: Fired. "Conduct concerns" dropped. Termination reason: "Outside Activities Violation." No severance. Nobody on my team was told in advance.

The App That They're So Afraid Of

So what is this app that Coinbase felt so threatened by? Let me show you.

NexusTrade: the app that got me fired.

NexusTrade is an AI-powered, no-code algorithmic trading platform. I built it as a solo developer over the course of five years. Users can create automated trading strategies, backtest them against historical data, optimize parameters using genetic algorithms, and deploy them live through brokerage integrations with TradeStation, Public, Alpaca, and Tradier.

It supports US-listed stocks, ETFs, cryptocurrencies, and listed options including multi-leg spreads. It has an AI agent called Aurora that can research stocks, build portfolios, and execute multi-step tasks through natural language. Users can share their strategies publicly, copy-trade other users, or monetize their strategies with a monthly subscription.

Now let's look at what Coinbase claimed it "materially competes" with.

Coinbase Advisor: "The Financial Advisor experience, democratized."

What is Coinbase Advisor?

Coinbase Advisor is an AI-powered financial advisory tool built into the Coinbase app. It was announced on December 17, 2025 during Coinbase's "System Update 2025" livestream. It's provided through Coinbase Advisors, LLC, a CFTC-registered Commodity Trading Advisor and NFA member.

Coinbase is a multi-billion dollar publicly traded company. The tagline for Advisor: "The Financial Advisor experience, democratized."

Here's what it actually does: users type natural language requests like "Build me a portfolio" or "What's the latest market news?" The AI asks clarifying questions about risk tolerance and investment goals, evaluates the user's existing holdings, and generates a personalized asset allocation. Users can review, edit, and adjust the recommendations before executing. It is non-discretionary, meaning the AI never trades without the user's explicit permission.

As of early 2026, Coinbase Advisor is still in beta and available primarily to Coinbase One subscribers. Its investment frameworks were developed by a team of portfolio managers with over 75 years of combined experience from traditional asset management.

The Side-by-Side

Here is a fair, feature-by-feature comparison between the two products:

Feature	NexusTrade	Coinbase Advisor
Primary Function	Algorithmic trading platform	AI financial advisor / chatbot
US Stocks & ETFs	Yes	Planned (stock trading launched separately)
Options Trading	Yes (multi-leg spreads, Greeks)	No
Cryptocurrency	Yes	Yes
Algorithmic / Rule-Based Trading	Yes (no-code strategy builder)	No
Backtesting	Yes (OHLC and intraday, Rust engine)	No
Portfolio Optimization	Yes (genetic algorithms)	No
AI Assistant	Yes (Aurora agent)	Yes (core product)
Personalized Recommendations	Via AI	Yes (curated by portfolio managers)
External Brokerage Integration	Yes (4 brokerages)	No (Coinbase ecosystem only)
Copy Trading / Social	Yes	No
Strategy Monetization	Yes	No
MCP / API Access	Yes (32 tools)	No
DCA / Automated Buying	Yes	Yes
Yield / Staking	Partial (~3% via Alpaca)	Yes (USDC lending)
Regulatory Status	Independent platform	CFTC-registered CTA, NFA member
Team Size	1 person	Multi-billion dollar public company
Status	Live, public	Beta, waitlist

Where They Actually Overlap

Let me be honest about the overlap. Both products have an AI assistant. Both support cryptocurrency. Both allow dollar-cost averaging. These are real similarities.

But "both have AI" and "both support crypto" is an extraordinarily broad definition of competition. By that logic, any software engineer who has ever built a side project involving AI or cryptocurrency is a competitive threat to Coinbase. That's not a reasonable standard.

The core products are fundamentally different. NexusTrade is an algorithmic trading platform. You build rule-based strategies, backtest them against years of historical data, optimize them with genetic algorithms, and deploy them across four external brokerages. Coinbase Advisor is a conversational AI that recommends crypto allocations within the Coinbase ecosystem. One is an engine. The other is a chatbot.

The math: Out of 18 feature categories, the two products share 4 (crypto support, AI assistant, DCA, yield/staking). NexusTrade has 10 features that Coinbase Advisor does not. Coinbase Advisor has 1 feature (CTA registration) that NexusTrade does not. The overlap is 22%.

Coinbase decided that 22% overlap was enough to call it "material competition" and fire someone who disclosed the project before their first day.

What I Learned

I did everything right. I read my offer letter. I used AI to review it. I disclosed my project in writing before my first day. I cooperated with Legal. I offered concessions. I asked for a professional resolution.

None of it mattered.

The lesson here is uncomfortable but important: disclosure does not protect you. Having the rules on your side does not protect you. Cooperation does not protect you. When a company decides you're a problem, the process exists to justify the outcome, not to find the truth.

I'm not bitter. I'm building.

NexusTrade is still live. I'm still a one-man shop. And now I don't have to ask anyone's permission to work on the thing I've spent five years of my life creating.

If you want to check out the app that got me fired, you can find it at NexusTrade.io. If you've ever been in a similar situation, I'd genuinely like to hear your story.

And to my former team at Coinbase: I'm sorry I didn't get to say goodbye. You deserved better than that. So did I.

This article reflects my personal experience and perspective. The timeline and events described are based on my own records and recollections. I have not disclosed any confidential Coinbase information or intellectual property. All opinions expressed are my own.