Forem: jucelinux

The ten principles: locality, contracts, quarantine

jucelinux — Sun, 17 May 2026 00:40:03 +0000

Sixth and final article in the Grounded Code series. The previous five built the diagnosis and the workflow. This one names the principles that hold the rest together.

I. The setup

We've covered the cost (article 1), the mechanism (article 2), the patterns to drop (article 3), the artifact (article 4), and the workflow (article 5). This article finishes the picture with the principles that hold it all together. Ten of them, in three clusters.

I want to make the framing explicit, because the clusters are doing real work. Each one answers a different question about how code stays useful when the primary reader has changed:

Locality answers: where should things be findable?
Contracts answers: how does truth get encoded so the agent can trust it?
Quarantine answers: where do exceptions live so they don't leak into the rest of the code?

These aren't arbitrary categories. They map directly to the three axes on which re-derivation cost is paid. When the agent re-derives, it's usually because (1) the thing it needs isn't where it expected, (2) the contract it needs to honor isn't visible from the file it's reading, or (3) an exception to a pattern is invisible until it bites. The principles below remove cost on each axis.

I'll walk through all ten, with code where it helps, and flag the two cases where the audit corrected me.

II. Locality (1 to 4)

Where things should be findable, and findable with one glob.

1. Co-locate by feature, not by layer

Every file related to one feature lives in one directory. The agent finds everything about a feature with a single glob.

src/payment/
├── payment.ts              implementation
├── payment.test.ts         tests as specifications
├── payment.spec.md         contract + verification
├── payment.types.ts        public types (if separated)
├── paymentInternals.ts     local helpers
└── paymentValidation.ts    validation if non-trivial

The opposite of this is layered layout: src/controllers/payment.ts, src/services/payment.ts, src/repositories/payment.ts. To extend the payment feature, the agent has to grep three directories, read three files, and reconstruct the feature mentally from fragments.

The only top-level layers that survive are ones the agent should rarely touch: bootstrap/ for unavoidable side effects, lib/ for genuinely cross-cutting utilities, config/, possibly vendor/. Everything else is a feature directory.

2. The triplet per feature: implementation, test, spec

Every feature carries three files with parallel names:

<feature>.<ext> is the implementation.
<feature>.test.<ext> is the test, with names that read as specifications of behavior, not as descriptions of which function is being tested.
<feature>.spec.md is the markdown contract with frontmatter (article 4 covers the format).

The glob src/<feature>/<feature>.* returns all three. The agent never has to guess where any of them live.

The three files reference each other. The spec's verification block names the test commands. The test names match invariants stated in the spec. The implementation's exported names match the public_api field of the spec. When the three drift, that's a regression, and a quick visual diff catches it.

3. Imports use full paths with explicit extensions

// Don't:
import { foo } from '@utils/foo'

// Do:
import { foo } from '../utils/foo.ts'

Path aliases save the human a few keystrokes and cost the agent a resolution hop on every import. The agent reads '../utils/foo.ts' and knows exactly where the file is. It reads '@utils/foo' and has to consult tsconfig.json to resolve the alias before it can navigate.

There's one allowed exception: workspace package boundaries in a monorepo (@org/shared-types). That's not ergonomic sugar; it's the package boundary itself.

4. Files under roughly 300 lines for what the agent will modify

A soft limit. Above 300 lines, the agent reads in slices and loses context between them. Before you reach the limit, split into co-located files with disambiguating suffixes (payment.ts, paymentCore.ts, paymentValidation.ts). All start with payment, which keeps the glob clean and signals that they're internal to the feature.

The audit correction. I initially framed this as a hard rule. The triangulation across the three coding-agent codebases showed pi violating it deliberately: a 3,110-line orchestrator file with an explicit defense in its AGENTS.md. The defense is that this is an orchestrator, it's meant to be read top to bottom, and splitting it would cost more in cross-file inference than the length costs in slicing.

The rule survives as a soft preference for feature files. Orchestrators, generated code, large data tables, exhaustive switches can be longer. Make the choice deliberate, name it in AGENTS.md or CLAUDE.md, and trust the type system to keep the larger file coherent.

III. Contracts (5 to 7)

How invariants get encoded so the agent can trust them without re-derivation.

5. Spec before code (the five-step loop)

For every non-trivial change, the spec gets written or updated first. Article 5 covers the full loop (spec, plan, implement, verify, consolidate). The principle here is the discipline behind it: don't write code until the contract is on disk.

The cost the principle removes is drift. Without a spec, the agent's implicit understanding of what it's building lives only in the prompt history, and the prompt history doesn't survive compaction or session boundaries. With a spec, the agent re-grounds in seconds at any point.

6. Types carry contracts; prose docs do not

Push invariants into the type system. Brand types, discriminated unions, exhaustive switches with never checks. A file that violates a type-encoded invariant fails to compile. A file that violates a prose-encoded invariant passes silently.

// Don't:
/** count must be positive */
function take<T>(items: T[], count: number): T[] { ... }

// Do:
type PositiveInt = number & { readonly __brand: 'PositiveInt' }
const positiveInt = (n: number): PositiveInt => {
  if (n <= 0 || !Number.isInteger(n)) throw new RangeError()
  return n as PositiveInt
}
function take<T>(items: T[], count: PositiveInt): T[] { ... }

The agent reads the type and knows the contract. It doesn't have to remember to read the docstring, and it can't accidentally call take(items, -1).

The same idea applies to state machines (discriminated unions like 'idle' | 'loading' | 'ready' | 'error'), to immutability (readonly everywhere you don't actually mutate), and to exhaustiveness (switch with never checks so adding a case forces the agent to handle it everywhere).

Types are also the bridge to the spec. The public_api field of the spec lists the same names the implementation exports. When they agree, the system is consistent. When they drift, the type checker catches it before the agent does.

7. Comments are the incident-and-rationale channel

A comment exists to prevent regressions or to explain non-obvious decisions. That is the entire job description. Anything else competes with the code's own self-description and creates drift.

Write comments for:

A bug or near-bug that motivated a workaround. "This branch was added after incident-2024-08-12; do not remove without re-reading the post-mortem."
A counter-intuitive decision. "Lazy require to avoid circular dependency with telemetry.ts."
A load-bearing exception to a project rule. The eslint-disable line above a deliberate side effect, with a sentence saying why.
A constraint imposed by a dependency that isn't grep-able locally.

Don't write comments for:

What the code does. The function name and types do that.
Plans for the future ("TODO: refactor this once X ships"). Use the issue tracker.
Boilerplate prose ("This method returns a string.").
Visual section dividers (// ===== HELPERS =====). The directory layout does that.

The test for whether a comment belongs: if removing it would not confuse a future reader, it shouldn't have been written. Comments are load-bearing only when they encode something the code can't say itself.

IV. Quarantine (8 to 10)

Where exceptions to patterns are contained, so they don't compound across the codebase.

8. Side effects are quarantined and labeled

Top-level side effects (anything that runs at import time and isn't a pure declaration) live in one place: bootstrap/, src/main.ts, or equivalent. A lint rule (no-top-level-side-effects or similar) blocks them everywhere else.

When an exception is genuinely needed, the deviation is explicit:

// eslint-disable-next-line no-top-level-side-effects
// MDM checks must run before module evaluation; see incident-2024-08-12
startMdmRawRead()

The agent respects the disable comment because it's load-bearing and declared. Without the comment, the agent will "clean up" the disable on its next pass, which is exactly the kind of confident wrong fix that quarantine prevents.

The principle scales beyond side effects. Anything you want the agent to find in one place, but never anywhere else, gets the same treatment: a quarantine directory, a lint rule, and explicit exceptions that name themselves.

9. Concrete dependencies; uniform DI only if it's truly uniform

Functions receive what they actually use, not interfaces they could hypothetically use. For real circular dependencies, use a lazy require with a one-line comment:

// Lazy require: telemetry.ts imports state.ts which imports this file
const getTelemetry = () => require('./telemetry.js')

For tests, prefer integration. Mock only what's genuinely external (network, filesystem, clock, randomness). A mock that stands in for first-party code is a maintenance burden that grows with the codebase.

The audit correction. I framed this as "no DI container" early on. The triangulation showed opencode using Effect's Layer pattern across the entire codebase, and it works. The cost is paid once, then amortized. The rule survives as "drop ad-hoc, heterogeneous DI," not as "drop DI."

The deeper principle is the one I named in article 3: heterogeneity is the disease, not abstraction. A codebase with five different DI patterns charges re-derivation cost on every feature. A codebase with one DI pattern applied uniformly everywhere charges it once. If you have DI, make it uniform, or drop it. The version to drop is ad-hoc DI that wraps first-party code without a multi-implementation justification.

10. Build-time feature flags when both branches don't need to coexist

import { feature } from 'bun:bundle'

const adminPanel = feature('ADMIN_PANEL')
  ? require('./adminPanel.ts')
  : null

The false branch is dead-code-eliminated. The agent reads the source and sees exactly what's in the build, with no need to correlate with a runtime flag service.

When flags genuinely need to flip without a rebuild (kill switches, gradual rollouts), isolate them behind a single function like isEnabled(flagName) that the agent can grep in one place. Avoid scattering process.env.X === 'true' checks throughout the code; each one is a different signal that has to be correlated separately.

A note on the audit. Of all the principles, this one diverged most across the three codebases. Claude Code uses Bun's build-time feature(). opencode uses runtime config. pi uses environment-gated branches. There isn't an agent-friendly consensus here, and probably the pattern is dominated by deployment constraints rather than agent ergonomics. I'm including it because the underlying idea (the agent reads what gets shipped) holds even when the implementation varies.

V. What the list doesn't include

A few things you may notice missing.

No dependency injection rule. It's folded into principle 9, with the audit correction.

No folder convention beyond per-feature. I want the agent to discover by globbing the feature directory, not by remembering a folder convention. The fewer rules the agent has to know up front, the cheaper the first read.

No linter or formatter rule. Project-specific. Encode in lint config, not in prose. The agent respects lint rules that block compilation. It does not respect rules that exist only in a style guide nobody enforces.

No "no any" rule. True, but it's a TypeScript-specific application of principle 6. Each language has its own surface.

The principles list deliberately stays in the agnostic core. Stack-specific guidance lives in overlays, which is what the next phase of the series is about.

VI. What's still in motion

The series so far has been honest about scope. I want to keep being honest at the close.

The three codebases I audited are all coding agents. Claude Code, opencode, pi. Some of what converges across them may be specific to that subculture. The principles that handle agent reading mechanics (locality, contracts, quarantine of side effects) probably travel. The principles around feature granularity and uniform DI may need adjustment for general-purpose application backends and frontends. The overlays will be where that adjustment lives.

Three things I was wrong about. Article 3 named them, and I'm repeating the headlines here because they're load-bearing:

The 300-line file limit isn't a hard rule. It's a soft preference for feature files. pi proved this.
"No DI" isn't right either. Heterogeneous DI is the problem; uniform DI is fine. opencode proved this.
Build-time feature flags diverged across all three codebases. There's no agent-friendly consensus, and I'm holding the principle as provisional.

What I expect to be wrong about going forward. Specific predictions, so I can be tested on them:

The triplet-per-feature rule will need adjustment for serverless, where a feature is often a single function with no separate test file in the traditional sense.
The 300-line soft limit will probably need to grow as model context windows enable longer effective slices. Whether that means 500 or 1,000 is empirical.
Some principle I currently believe will turn out to be specific to TypeScript codebases and not survive the move to Rust, Python, or Go.

If you spot any of these or others I haven't seen, that's the kind of feedback the series is asking for. The thesis improves when it gets falsified in public.

VII. What's next

The agnostic core is done. The series so far covered:

The cost (the manifesto, with the 7.5x receipt).
The mechanism (how agents read).
The patterns to unlearn (anti-patterns from the human era).
The central artifact (<feature>.spec.md and out_of_scope).
The workflow (the five-step loop).
The principles (this one).

The next phase is overlays: short stack-specific documents that answer the three questions the agnostic core deliberately leaves open. What counts as a feature in this stack? Where do irreducibly cross-cutting concerns live? How does the deploy unit align with the feature unit?

Overlays for React, backend TypeScript, serverless, and fullstack are in draft. I'm holding them until the agnostic core has had enough public exposure to surface what's wrong with it. There's no point shipping overlays on top of principles that might still shift.

If you want to apply any of this before the overlays land, the five-minute experiment from the manifesto is still the cheapest way in. Pick a feature, co-locate its files, write the spec, run the agent twice (once on the old layout, once on the new), and compare the token receipts. The number you get back is the argument the series doesn't have to make.

VIII. The close

Six articles, written across a stretch of months, with the framing shifting more than once as the audit corrected my early intuitions. The version of Grounded Code I'd defend now is narrower than the version I'd have published if the audit hadn't happened, and the narrower version is the one I trust.

A few things I want to leave you with.

The principles aren't commandments. They're observations about what reduces re-derivation cost on the three coding-agent codebases I studied. If a principle here costs you more than it pays back in your context, drop it. The framework survives the loss of any individual principle. What it doesn't survive is forgetting the central question: when the agent extends feature X, can it find everything with one glob, read it in a small number of files, and trust the types plus the spec for the contract?

The architecture is for an agent. The code is still for you. Everything in this series adds a layer that makes agents fast and accurate. None of it subtracts from human readability. Where the two trade off, the cost to the human is small (slightly more boilerplate, slightly less clever abstraction) and the gain to the agent is large (one grep instead of three, one file instead of seven). If you ever feel the framework pushing you toward code that's worse for humans, push back. Something is being misapplied.

This is a working name. I've been calling the approach Grounded Code throughout the series because the central image (each feature carrying its own ground in the form of a spec) keeps holding. If a better name surfaces, I'll switch. The point isn't the brand. The point is the question.

If you've made it through all six articles, thank you for the time. The series exists because the question seems worth working on in public, and the conversation it generates is the only way I'll know which parts hold and which I've gotten wrong. Send pushback. Send falsifications. Send the experiment results from your own codebase. The thesis gets better when it gets tested by people who didn't write it.

The series so far: [1. Manifesto] [2. The primary reader changed] [3. What to unlearn] [4. The central artifact] [5. The five-step loop] [6. This one].

Next phase: stack overlays. When they're ready.

The five-step loop: spec, plan, implement, verify, consolidate

jucelinux — Sun, 17 May 2026 00:34:46 +0000

Fifth article in the **Grounded Code* series. The previous article introduced the spec. This one shows the workflow that turns the spec into a working feature without drift.*

I. The setup

The spec is the artifact. The loop is the workflow that runs on top of it.

A reasonable objection at this point: "every engineer already does some version of this. Think before you code, plan the approach, write the code, run the tests, clean up. Why does this need a five-step ceremony with a name?"

The honest answer is that nothing in the loop is new in concept. What is new is that every step becomes a persistent artifact, because the primary reader of the codebase no longer sustains implicit habits across turns. The agent has no memory of what you decided yesterday. It has no internal narrative of "I planned it this way because." The "plan" the human carries in their head has to become a written plan that the agent can re-read. The "decision to test it manually" has to become a verification block in the spec. The whole loop is engineering discipline rendered as artifacts, because the new reader cannot inherit them otherwise.

This article has two non-obvious claims I want to defend explicitly:

Step 2 (plan mode) is the step that saves the most time. A bad plan is cheap to throw away. A bad implementation is expensive.
Skipping step 1 (the spec) is the most expensive failure mode I keep seeing. The drift it causes compounds across turns and is often invisible until verification passes for the wrong feature.

Both claims came out of watching the loop run, and watching it fail when steps were skipped. Neither was obvious to me when I started.

II. The five steps

1. Spec        Write or update <feature>.spec.md before any code.
2. Plan        Enter plan mode. Agent reads, proposes, does not write.
3. Implement   Exit plan mode. Execute the approved plan.
4. Verify      Run the spec's verification block exactly as written.
5. Consolidate Commit, reconcile spec with reality, clean the scratchpad.

Every non-trivial change goes through this loop. The point of the loop is that the spec stays the anchor across all five steps. At any moment, the agent can re-read it and re-ground itself in seconds. Without an anchor, the agent's context drifts. With one, drift is cheap to recover from.

I want to walk through each step, but spend the most time on the two where I think the case isn't obvious.

III. Step 1, where most people want to skip

Before any code changes, write or update <feature>.spec.md. For a new feature, start from a blank spec. For a change to an existing feature, update the frontmatter (invariants, in_scope, out_of_scope) and the body sections that move.

This is the step humans most want to skip. Every time. The instinct is "I know what I'm doing, let me just code." I had it. Every engineer I've watched try this loop had it.

Don't skip it. This is the claim I want to defend.

The spec does not have to be perfect. It has to be honest about what you know now, and specific about the contract. A useful prompt to bootstrap from a vague idea:

"Update src/payment/payment.spec.md to reflect that discounts now stack multiplicatively, not additively. Update the invariants field and the relevant edge case in the body. Do not change the implementation yet."

The "do not change the implementation yet" matters. It keeps step 1 small and reviewable.

Here is why skipping step 1 is the most expensive failure mode I keep seeing.

Without a spec, the agent has no contract to hold onto. The first prompt establishes some implicit intent in the agent's working memory, but that intent is not durable. Three turns later, when the context is partially compacted and the agent is reasoning about an edge case, the original intent is no longer cleanly available. The agent guesses. The agent will fill in plausible defaults that almost-but-not-quite match what you wanted.

The compounding effect is what makes this expensive. A wrong decision in turn three influences turn four, which influences turn five. By turn ten you have a feature that "works" (tests pass, code compiles) but isn't what you asked for. Sometimes you catch it. Sometimes the PR ships.

I've seen this fail in two specific ways that are worth naming. The first: the agent extends the feature with something out of scope (payment authorization snuck into a pricing function) because nothing told it not to. The second: the agent solves the right problem with the wrong contract (a function returns the right value but with a signature that breaks two downstream callers) because the public API was never declared.

Both failures cost more to fix than the spec would have cost to write. The spec is ten to fifteen minutes of work that prevents the kind of drift that takes hours to diagnose and clean up.

If you genuinely cannot write a spec because the feature is research, write a research spec. The public_api field is empty. The purpose is "explore X." The body describes what you're trying to learn. This is still an anchor; it just anchors a learning loop instead of an implementation loop.

IV. Step 2, the step that saves the most time

Once the spec is updated, enter plan mode. In Claude Code this is /permission plan. Other harnesses expose it differently. The mechanic is the same everywhere: the agent can read and propose, but cannot write to disk.

Ask the agent to propose a plan against the spec. Something like:

"Here is the updated spec. Propose a plan for the implementation. Do not write any code yet. The plan should be a numbered list of changes, each small enough to verify independently."

The plan that comes back is your contract for step 3. Read it. Push back on anything that doesn't match the spec. If the plan reveals that the spec was unclear or wrong, go back to step 1 and refine.

This is the step I want to make the strongest case for, because the value isn't obvious until you've seen it pay off.

The economic argument is simple. A plan is cheap to read, cheap to revise, cheap to throw away. A bad plan costs maybe two thousand tokens to produce and another five hundred to discuss. If you discover the plan is wrong, you've lost less than a coffee.

A bad implementation is expensive. It's hundreds or thousands of lines of code. It dirties your git state. It introduces test failures that have to be cleaned up. It rebuilds your mental model into something that doesn't match what you actually wanted. Throwing it away is hard, both technically and psychologically (the sunk cost is real).

Plan mode catches the wrong direction before it costs anything substantial. That alone would justify the step.

There's a second effect that I think matters even more. The agent reads more carefully in plan mode. The mental frame is "I am proposing," not "I am executing." It surveys the codebase, identifies the relevant files, considers alternatives. In implementation mode, the agent is reaching for tools and writing code. The reading-to-writing ratio inverts.

The plan that emerges from plan mode is usually better than the plan that would have emerged from "just write the code." Not because the agent is smarter in plan mode, but because the constraints of the mode make it think differently.

The third effect is alignment with you. Plan mode produces an artifact that's small enough to actually read. You can object to step three of seven without having to read a thousand lines of code. That objection is the cheapest possible alignment, and it happens before any code exists.

When I audited the three coding-agent codebases for the series, all three had explicit plan-mode-like affordances. Heavily used. Not optional. That convergence is the strongest evidence I have that this step is load-bearing.

The cost of plan mode is real, mostly in turns and in waiting for the agent to think. Use it for the survey and the first-cut plan, then exit plan mode and execute. Don't try to perfect every detail in plan mode; iteration there has diminishing returns.

V. Step 3, implement

Exit plan mode. Have the agent execute the approved plan. Watch the diffs as they arrive. Two things to watch for.

The diff stays inside the feature directory. If the implementation is touching files in five other features, either the feature is too big (split it, go back to step 1) or the change is genuinely cross-cutting and the feature boundary is wrong (rethink the spec). Co-location is one of the rules the diff makes legible. When the diff sprawls, something is off architecturally.

Public API matches the public_api field of the spec. If the agent renamed an exported symbol or changed a signature without updating the spec, that's drift. Catch it now. Either the spec needs to update or the implementation needs to revert. Don't let the spec and the implementation diverge silently.

For non-trivial changes, do the implementation in micro-commits: one small, self-contained change per commit, each verifiable on its own. The spec stays the same across them. The diff is small. The rollback target is clear. If you're new to this pattern, the discipline feels slow at first. It pays back the first time you have to bisect a regression.

If you find yourself wanting to fork an exploration mid-implementation ("let me try a different approach and see if it works better"), don't do it in the main thread. Either go back to plan mode for a quick alternative survey, or fork a separate session for the experiment. Mixing exploration and implementation in the same context is how diffs sprawl.

VI. Step 4, verify

Run the verification block from the spec, exactly as written. Don't substitute "run all tests" for "run the specific test file the spec named." The specificity is the point. The verification block is a contract that the feature is locally testable, which means changes to it don't require running the entire test suite.

If verification fails, three things might be true:

The failure surfaces a real bug. Fix it. The fix is its own small loop (maybe update the spec to add the edge case, plan the fix, implement, re-verify). Treat it as a micro-loop, not as an emergency.

The failure surfaces a wrong test. Update the test. If the test was enforcing a rule that's no longer true, the spec also has to change. This is a hint that step 1 was incomplete. Usually the invariants field was missing a rule, or the rule was wrong.

The failure surfaces an environment issue (missing env var, broken dependency, port conflict). Fix the environment. But capture the fix somewhere durable, like CLAUDE.md or the project README. If the agent had to figure it out once, it'll have to figure it out again next session unless the lesson lands somewhere persistent.

Don't suppress failures by removing the verification check or commenting out the failing assertion. That's the kind of move that looks fine in the moment and rots over months.

VII. Step 5, the step that prevents cruft

Once verification passes, three things happen in order.

Commit. A single commit per feature change is ideal. The commit message references the spec: "update payment.spec.md: stacked discounts; implement; tests pass."

Reconcile the spec with reality. If the implementation revealed something the spec didn't anticipate, update the spec to match. The spec should describe what the code actually does, not what it was originally intended to do. This is the move that prevents stale-spec rot. The discipline is small (usually one or two field edits) but it's the only thing that keeps the spec trustworthy session after session.

Clean the scratchpad. Anything in the working memory that was relevant to this loop and only this loop, drop it. Anything that was a load-bearing learning that should survive the loop, hoist it into the spec, into CLAUDE.md, or into a relevant .claude/rules/<topic>.md file. The agent's working memory is bounded. Treating it as long-term storage is one of the most reliable ways to compact a session into uselessness.

Step 5 is the step that prevents loops from accumulating cruft. Without consolidation, every subsequent loop starts in a worse state. With it, the codebase stays grounded in its own truth.

I see step 5 skipped almost as often as step 1. The cost of skipping it is slower, more diffuse, and harder to attribute to any one decision. But it's real. After ten unconsolidated loops, the spec is no longer trustworthy, the scratchpad is full of dead context, and the next session has to spend tokens rebuilding what the previous one knew.

VIII. When the loop breaks down

A few common failure modes and what to do about each.

The spec is too vague to write yet. You're in research mode, not implementation mode. Use a research spec, as described in section III.

The plan in step 2 is huge. The feature is too big. Split it. Each split feature gets its own spec and its own loop. The original spec might still exist as an umbrella spec that links to the smaller specs in its "Related specs" section.

The agent keeps drifting from the plan during step 3. Re-read the spec out loud (in chat). Ask the agent to confirm the next single change against the spec. If it can't, the plan was wrong; go back to step 2.

Verification passes but the feature is wrong. The spec is wrong. The verification block didn't capture what was actually important. Update the spec, usually the invariants field is missing a critical rule, and re-run.

The loop is overkill for tiny changes. It is. For a typo fix or a one-line patch, skip the loop and just do it. The loop is for non-trivial changes. The line is fuzzy. When you're unsure, err toward the loop. The cost (a few minutes on the spec) is small, and the benefit (catching misalignment before it compounds) is large.

IX. Honest scope

This loop is overhead. Five to fifteen minutes of upfront work before any code happens. For features that take an hour, the math is good. For features that take five minutes, it isn't.

The discipline of the loop assumes you'll do it more than once. If you do this loop once, you're paying the full overhead cost without the amortization benefit. If you do it ten times on the same codebase, the spec accumulates knowledge that compounds across loops, and the cost-per-loop drops as your skill with the format grows.

If you can't commit to running the loop consistently, run it on the changes where misalignment is most expensive (new features, public-API changes, anything touching billing or auth). Skip it on cosmetic fixes, dependency bumps, and config tweaks. A partial loop is fine. A loop applied where it matters most is most of the value.

I don't think the loop is the only way to do this. It's the version I've seen converge across three coding-agent codebases that built it independently. Other versions exist (one team I know uses a four-step variant that folds consolidate into commit). The five-step version is what I'd start with, but the principles travel.

X. What's next

We have the artifact (spec.md) and the workflow (the loop). The final article in the series is the principles that hold underneath both: ten rules grouped into three clusters (locality, contracts, quarantine), each with code as proof.

If you want to try the loop before the next article ships, the cheapest experiment is to apply it to one feature you'd extend tomorrow. Write the spec. Plan in plan mode. Implement. Verify. Consolidate. Count the time you spent, and compare it to the next ten minutes of context recovery you would have spent in a free-form session. The receipt is the argument, same as it has been all series.

Next in the series: *"The ten principles: locality, contracts, quarantine."***

The central artifact: spec.md and the field tests-first never had

jucelinux — Sun, 17 May 2026 00:29:16 +0000

Fourth article in the **Grounded Code* series. The previous articles established the cost, the mechanism, and the patterns to drop. This one introduces the artifact that holds the whole system together.*

I. The setup

The first three articles were diagnostic. The cost is 7.5x in tokens on the same feature. The mechanism is how agents read with grep instead of an IDE. The patterns that score badly are the ones that paid for themselves on the old axis and stopped paying on the new one.

This article goes positive. What does the agent actually need to stay anchored across turns, across sessions, across compactions, across the moment three weeks from now when a different developer asks the agent to extend the same feature?

The answer is a small markdown file that lives next to the code. It is the document the whole architecture turns around. It is where Grounded Code earns its name.

I want to make the article's three promises explicit, because they map directly to what I want you to walk away with:

The full format of <feature>.spec.md.
A complete example.
The field tests-first never had, and why it matters more than I expected.

II. The full format

A spec is a short markdown file with frontmatter (the parseable contract) and a body (the prose rationale). It lives at src/<feature>/<feature>.spec.md, next to the implementation and the test.

The skeleton:

---
purpose: One sentence, present tense, what this feature does.
public_api:
  - functionName(arg: T): U
  - OtherType: { field: T }
invariants:
  - Statement that must always hold.
  - Another statement.
in_scope:
  - What this feature is responsible for.
out_of_scope:
  - What this feature explicitly does NOT do.
verification: |
  pnpm test src/payment/payment.test.ts
  pnpm typecheck
---

# <Feature> spec

A short prose narrative explaining the why of this feature.

## Why this exists

What problem this feature solves, in one paragraph.

## Edge cases worth knowing

A list of edge cases the implementation handles.

## Related specs

Links to specs of features this one interacts with.

Six frontmatter fields, three body sections. The whole thing is usually 60 to 120 lines. Anything shorter is under-specifying. Anything longer means the feature is too big and should split.

Let me walk through the frontmatter fields.

`purpose`

One sentence, present tense, what the feature does. Fifteen to twenty-five words. This is what the agent sees first when it opens the spec, so it has to convey the feature without marketing language.

"Computes the final price after discounts, taxes, and currency conversion."

Not:

"A robust, extensible system for handling our complex pricing logic with support for various edge cases."

`public_api`

The exported names and their signatures. This is the surface the rest of the codebase depends on. If you add or remove a name here, downstream code is going to break. So this field doubles as a "things that must not be renamed without a migration" list.

You don't list every internal helper. Just the surface.

`invariants`

Statements that must always hold, written as plain English. The implementation enforces them, the tests verify them, the spec declares them. When all three agree, the system is consistent.

"The total returned by computeTotal is non-negative."

"If applyDiscount is called twice with the same code, the second call is a no-op."

`in_scope` and `out_of_scope`

In_scope is what the feature does. Out_of_scope is what it does not do, and crucially, what it should not be extended to do without an explicit decision.

I'm going to spend an entire section on out_of_scope below, because this is the field I keep finding to be more load-bearing than anything else in the spec. It's the one piece of the format that has no analog in tests-first development.

`verification`

The exact commands to run to verify the feature works. Copy-pasteable. This is what the agent runs as the gate before considering a change done.

verification: |
  pnpm test src/payment/payment.test.ts
  pnpm typecheck
  pnpm lint src/payment

If verification involves running the app and checking something manually, say so explicitly. The agent will then know to surface the request rather than silently skip it.

The body

Three sections, in order.

Why this exists. One paragraph. The motivation. What changes if this feature did not exist. If the feature exists because of an external constraint (a regulation, a third-party API quirk, an incident), name it.

Edge cases worth knowing. A list. Each bullet states an edge case and its handling, with the justification. This is where lessons from production live. The things that would otherwise be encoded as comments scattered through the implementation get hoisted into one readable place.

Related specs. Links to the specs of features this one talks to. Relative paths, so the agent can follow them with one read.

III. A complete example

Here's the spec for a pricing feature, fully filled out:

---
purpose: Computes the final order total including discounts, taxes, and tip.
public_api:
  - computeTotal(order: Order, rates: Rates): Money
  - Money: { amount: PositiveInt; currency: Currency }
invariants:
  - The total is non-negative.
  - The total is in the currency of the order.
  - Discounts apply before taxes; tip applies last.
in_scope:
  - Discount application (percent and fixed-amount).
  - Sales tax based on shipping address.
  - Tip as percentage of pre-tax subtotal.
out_of_scope:
  - Payment authorization.
  - Currency conversion (rates are received as input).
  - Refund computation (see ../refund/refund.spec.md).
verification: |
  pnpm test src/total/total.test.ts
  pnpm typecheck
---

# Total spec

## Why this exists

Computing the total is non-trivial because of ordering: discounts before
taxes (legally required in our jurisdictions), tip after tax (customer
expectation), and rounding at each step (otherwise we accumulate cent-level
discrepancies that fail reconciliation).

## Edge cases worth knowing

- **100% discount:** total is zero, not negative. Tax on a zero subtotal
  is zero.
- **Tip on a fully-discounted order:** zero (tip is percentage of pre-tax
  subtotal which is zero).
- **Mixed-currency line items:** rejected upstream; this function assumes
  single-currency orders. The type system enforces this via the `Order`
  type's `currency` field.

## Related specs

- `../order/order.spec.md` for the upstream caller.
- `../refund/refund.spec.md` for the inverse computation.

That's a sixty-line file. The implementation it describes is probably one hundred fifty lines. The tests are probably two hundred. The agent navigates all three with one glob (src/total/*) and rebuilds full understanding in two reads.

Notice what isn't in the spec. There's no marketing language. There's no implementation. There's no example of usage. The examples live in the test file, where they execute. The spec is a contract, not a tutorial.

IV. The field tests-first never had

I want to make a careful argument here, because TDD has earned its place over twenty years and "tests aren't enough" is not a casual claim.

Tests-first does real work. It forces the developer to state what passing looks like before coding. It catches regressions across changes. It documents behavior in an executable form. None of that goes away with spec.md. The verification field of the spec literally points at the test suite. The tests are the gate. The spec is the anchor. They do different jobs.

But there are two things tests structurally cannot do, no matter how well you write them:

1. Tests don't carry rationale. A test says expect(applyDiscount(order, '50OFF')).toEqual(...). It tells you what the system does in that case. It doesn't tell you why the discount applies before tax instead of after. It doesn't tell you whether the rounding was a deliberate decision or an accident of the implementation. When the agent has to decide whether a new requirement is a feature change or a bug fix, the tests cannot help. The rationale is somewhere else (a code comment, the head of the engineer who wrote it, a Slack thread from 2023). The spec's "Why this exists" section is where rationale belongs.

2. Tests don't carry the boundary. This is the one that surprised me. Tests state what is true about the feature. They do not state what is not part of the feature. Nothing in your test suite says "this function should not be extended to handle payment authorization." A reasonable agent, asked to extend the pricing function with a payment step, would do so. The test would not fail. The PR would even compile. And now a function that was supposed to be pure pricing computation is doing payment authorization, with all the failure modes that implies (latency, retries, error handling, observability).

The out_of_scope field is the explicit boundary. Listed, parseable, sitting at the top of the spec. When the agent reads the spec, it reads what the feature does NOT do, and the boundary is anchored. Asked to add payment authorization, the agent now responds with "that's listed as out_of_scope; should I create a new feature or expand the scope of this one?" That single question prevents a class of drift that no amount of test coverage can catch.

In the audit of the three coding-agent codebases, this is the field I kept seeing carry weight. opencode's specs explicitly list things the feature won't do. Claude Code's AGENTS.md documents at the codebase level use the same pattern. The boundary, once stated, holds.

The format calls this field load-bearing. After watching it work in practice, I think load-bearing is an understatement. The out_of_scope field is what stops a feature from quietly becoming five features.

There's a secondary effect that's almost as valuable. When you sit down to write out_of_scope and find the field hard to fill, that's a signal. It usually means the feature itself has no real boundary. The discipline of filling in out_of_scope is the discipline of deciding what the feature actually is.

V. What does not go in the spec

A short list, because the spec is small on purpose.

The implementation. The spec describes; it doesn't implement.
Examples of usage. Those go in the test file, where they're executable. The test name is the spec sentence; the test body is the example.
TODOs. Use the issue tracker. TODOs in a spec rot.
Diagrams. If a diagram would help, prefer the type system, then a single ASCII or Mermaid block. Don't reach for a separate image.
Marketing copy. The spec is for the agent and the engineer, not the website.

The spec is a contract document. Anything that doesn't help the agent or the engineer extend the feature correctly is noise.

VI. Honest limits

Two things I want to flag before someone tries this and is disappointed.

Specs cost time to write. The first spec for a feature takes maybe ten to fifteen minutes to fill in properly. That's real. The benefit shows up across all subsequent agent sessions, but the initial cost is real. For tiny features (a one-line utility, a single config constant), the cost outweighs the benefit. Skip the spec for those.

Specs need to be maintained. A stale spec is worse than no spec. If the implementation drifts away from the spec, the agent will trust the spec and produce wrong code. The discipline that makes this work is the discipline of updating the spec when the feature changes. The next article in the series (the loop) puts a step five at the end of every change for exactly this reason: consolidate the spec back to truth.

If you can't commit to updating the spec, don't write one. Stale documentation is the older, more familiar version of this problem, and the agent makes it worse, not better, by trusting confidently wrong content.

VII. What's next

The spec is the anchor. The next article is about the loop that runs on top of it: spec, plan, implement, verify, consolidate. Five steps, with the read-only plan mode being the one that saves the most time, and skipping step one being the most expensive failure mode I have seen in practice.

If you want to try writing a spec before the next article ships, pick a feature you'd extend tomorrow and fill in the frontmatter. Don't worry about the body yet. The frontmatter alone will tell you whether the feature has a coherent boundary. Half the value of the spec lands in the first ten minutes.

Next in the series: *"The five-step loop: spec, plan, implement, verify, consolidate."***

What to unlearn: the most expensive patterns from the human era

jucelinux — Sun, 17 May 2026 00:20:47 +0000

Third article in the Grounded Code series. The first two established the axis. This one names the patterns that score badly on it, and the ones I was wrong about.

I. The setup

The manifesto showed the cost. The second article showed the mechanism: how agents read, why re-derivation cost has a measurable unit (the token), and why bigger context windows don't fix it. We have the new reader, the new axis, and the receipts.

This article goes prescriptive. Which specific patterns earned their place when humans were doing the reading, and now cost more than they pay back?

I want to mark the framing carefully before I start, because the answer is more interesting than "delete these from your codebase." Some patterns survive untouched. Some survive in a modified form. Some need to go. And in three places, my initial intuition was wrong, and the audit corrected me. The last section names those.

II. The deeper finding: heterogeneity, not abstraction

Before going through the patterns one by one, I want to flag the conceptual move that came out of the audit and reshaped my read of every one of them.

I started by assuming abstraction itself was the cost. Each interface, each layer, each indirection meant another grep, another file read, another re-derivation. The natural conclusion was "less abstraction, fewer hops."

That conclusion turned out to be wrong, or at least incomplete.

The cost compounds when abstractions are heterogeneous. A codebase with five different DI patterns sprinkled around (one feature uses constructor injection, another uses a service locator, a third uses a context object, a fourth uses ambient module state) charges the agent re-derivation cost on every feature. The agent has to learn each pattern fresh, every time.

A codebase with one abstraction, applied uniformly everywhere, charges the cost once. The agent reads the Layer module (or the DI container, or the service registry) once, and every subsequent feature is a constant-cost extension of that knowledge.

This is what I saw in opencode's Effect Layer pattern. Every dependency is wired the same way, in the same place, with the same shape. The pattern is abstract, even heavy. But it's also uniform. The cost is paid once and amortized.

So the rewrite of the prescription is:

The disease is heterogeneity. Uniformity dissolves abstraction cost. Reduce the number of distinct patterns in your codebase, and apply them everywhere.

Holding this in mind, the anti-patterns below are mostly cases where the pattern is hard to apply uniformly across a real codebase, which means it will be heterogeneous in practice, which means it will compound cost. Not "abstract is bad," but "this particular abstraction tends not to stay uniform, and that's what costs."

III. Premature abstraction

Looks like: three call sites that look similar get extracted into a generic helper before there's a real reason to share.

Why it worked for humans: DRY was drilled into us as a virtue. Three concrete duplications felt like an open invitation to refactor. The shared helper communicated "these things are the same."

Why it costs agents: the generic helper has to be grepped twice (definition plus each caller). Three concrete sites are self-documenting and let the agent read each in isolation. The abstraction often turns out to be wrong as soon as the fourth case shows up, and refactoring a generalized helper is harder than refactoring three concrete copies.

There's also a uniformity problem. Premature abstractions tend to leak. The shared helper accumulates flags, options, and "just this one case" branches over time. Two years later, no one wants to read it. The agent reads it in a fog and pattern-matches the wrong thing.

Rule of thumb: wait for the fourth use case before generalizing. The Claude Code source has this rule encoded into its own prompt to the model. Three duplications is not the trigger. Four is.

What to do instead: for the first three sites, copy. Each site stays small and readable. When the fourth case appears, the actual shape of the generalization is usually obvious, and the helper that emerges fits.

IV. Ad-hoc dependency injection

Looks like: functions and classes that take five to twelve interfaces in their constructor or signature, each interface defined in a separate file, wiring done at call sites or by hand.

Why it worked for humans: it made the unit testable in isolation. It communicated "this thing depends on these capabilities." It also let teams parallelize work, each owning an interface contract.

Why it costs agents: every interface is another grep before the agent knows the actual call shape. A function that takes { db, logger, metrics, clock } looks self-documenting until you realize the agent has to load four interface definitions and their primary implementations to understand what happens at runtime.

The deeper problem is heterogeneity, which I flagged in section II. Ad-hoc DI rarely stays uniform. Feature A wires deps one way, feature B another. The agent learns each wiring fresh.

The exception that matters. Uniform DI works. opencode uses Effect's Layer pattern across the entire codebase: every dependency wired the same way, in the same place. Once the agent has read the Layer module, every feature is a constant-cost extension. The cost was paid once.

So the prescription isn't "drop DI." It's "if you have DI, make it uniform, or drop it." The version to drop is ad-hoc, heterogeneous DI that wraps first-party code without a multi-implementation justification.

What to do instead: concrete dependencies plus integration tests. The function imports what it uses. The test runs the real thing where possible. Reserve interfaces for genuine multi-implementation cases (a StorageAdapter with S3, Local, and InMemory implementations is fine; the abstraction is paying its way).

V. Codegen from schema

Looks like: an OpenAPI, GraphQL, or Protobuf schema, plus a build step that generates TypeScript types and client functions into src/generated/.

Why it worked for humans: single source of truth. "Free" types from the schema. Less hand-written boilerplate.

Why it costs agents: when the agent needs to add a field, it now has to identify (1) the schema source, (2) the codegen rule, (3) the build step, (4) the regen command, and (5) avoid editing the generated file directly. Five hops instead of one. The generated file is also typically gitignored or huge, both of which break the agent's mental model of the repo.

The "saves engineering time" argument is largely empty when the engineer is an agent that types at the speed of light. The original cost was human keystrokes. That cost is now near-zero. The cost that replaced it is the agent's re-derivation work across the codegen toolchain.

The exception that matters. Wire formats that humans should not author by hand (Protobuf descriptors, ABI bindings, large enum tables from external authorities). For these, keep the generated code committed and obvious. Don't bury it under .gitignore.

What to do instead: hand-write the boilerplate. Agents handle boilerplate well, in seconds. Reserve codegen for cases where the generated output is genuinely not human-authorable.

VI. Macros and decorators with implicit behavior

Looks like: @Injectable(), @Controller('/users'), @AutoMap, or any decorator that mutates the class it sits on without naming what it does at the call site.

Why it worked for humans: declarative. Reduced boilerplate. "Just works" if you knew the framework.

Why it costs agents: the behavior comes from somewhere not co-located. The agent reads @Controller('/users') and has to know what @Controller does, where it's defined, and what it injects into the class at runtime. Pattern-matching on identifiers fails when the meaning is elsewhere.

The deeper issue is that decorators are often conditional on framework state the agent can't see. Whether a method gets wrapped, whether a property gets serialized, whether a route gets registered: all of this can depend on runtime ordering, module load order, or other decorators on the same class. The agent reads the source. The behavior is at runtime. The two don't match.

The exception that matters. Decorators that are purely declarative and have no behavioral side effects (a @deprecated marker, for example, or a pure type-level brand) are fine. They're metadata, not behavior.

What to do instead: an explicit registration call at module level. The agent sees the registration in the same file as the class and infers the wiring directly.

// Instead of:
@Controller('/users')
class UserController { ... }

// Do this:
class UserController { ... }
router.register('/users', UserController)

The registration line is right there. The agent reads one file and knows what's going on.

VII. Patterns I'll mention briefly

I covered three of these in detail in the second article. Reposting the headline cases here for completeness; refer back for the full mechanics:

Path aliases (@utils/foo). Every import becomes a two-step resolution. Use relative paths with explicit extensions. The one exception is workspace package boundaries in a monorepo, where the alias is the package name itself.
Inheritance trees. Method resolution order is invisible at the call site. The agent has to walk the chain. Use composition: small functions, small types, explicit calls.
Living style guides disconnected from code. A STYLE.md with prescriptions that the actual code doesn't follow competes for the agent's attention. Encode style in lint config that fails the build, or in .claude/rules/style.md that walks-discovers. The agent reads the code, not the guide.

VIII. What I was wrong about

This is the part of the article I owe the reader, because the audit didn't only confirm my hypotheses. It also corrected three of them.

1. Strict 300-line file limit. I expected to find a hard rule across all three codebases. I didn't. pi has a 3,110-line file with an explicit defense in its AGENTS.md: this is an orchestrator, it benefits from being read top to bottom, it's not feature code. The rule survives as a soft preference for feature files, not as a hard cap. Orchestrators, generated code, and large data tables can be larger.

2. Strict no-DI. I framed DI as an anti-pattern early on. The audit showed opencode using Effect's Layer pattern uniformly across the codebase, and it works. The cost is paid once. The rule survives as "drop ad-hoc, heterogeneous DI," not as "drop DI." Uniform DI is fine. The disease was heterogeneity, not abstraction.

3. Build-time feature flags. I expected convergence on one pattern (probably Bun's feature() constant, since that's what Claude Code uses). The three codebases diverged. Claude Code uses build-time constants. opencode uses runtime config. pi uses environment-gated branches. There's no agent-friendly consensus here, and probably the pattern is dominated by deployment constraints, not agent ergonomics.

If you spotted other things I have wrong, that's exactly the kind of feedback the series is asking for. The thesis improves when it gets falsified in public.

IX. What's next

We now have the patterns. The next article goes to the central artifact that makes the whole system work: the <feature>.spec.md document that sits next to the code, gives the agent a ground truth one read away, and carries the field tests-first never had: out_of_scope.

If you want to start applying any of this before the next article ships, the cheapest move is the one from article 1: pick a feature, co-locate everything about it in one directory with parallel names, and watch what happens to your agent's token consumption. The receipts are the argument.

Next in the series: "The central artifact: spec.md and the field that tests-first never had."

The primary reader changed

jucelinux — Thu, 14 May 2026 22:20:54 +0000

How agents actually read code, why re-derivation cost has a unit, and why this doesn't go away when context windows grow.

Second of five articles in the **Grounded Code* series. The first one showed the cost. This one shows the mechanism underneath.*

I. The receipt, and what's underneath it

In the manifesto I showed a number: 7.5x more tokens for the same feature, with the same final verification on both sides. That's the receipt. This article is about what produced it.

The short version is that code has a new primary reader, and the new reader works with a completely different set of tools, a completely different memory model, and a completely different way of recognizing patterns. The patterns we adopted over twenty years answered the old reader's needs. The cost in the new reader's mode of reading wasn't on anyone's chart, because there was no reader to charge it to.

There is now.

II. How a human reads code

For a moment, ignore the agent. Look at what your IDE has been doing for you.

When you open a feature in a clean codebase, you carry roughly five to ten files in working memory at any time. Your tooling lets you jump between them in a fraction of a second. Cmd+P, fuzzy match, you're in the file. Cmd+click on a symbol, you're at its definition. The IDE follows path aliases, computes type inference, resolves imports, highlights related symbols, all silently.

You recognize patterns by similarity. Two functions that "look like they do the same thing" register as related even if they share no literal substring. A PaymentService and a UserService register as parallel because of conceptual symmetry, not because of any string they share.

You carry the project's mental model across days. The feature you worked on Tuesday is still in your head Thursday morning. You don't need to re-derive what the system does every time you open the editor.

Domain language helps you bridge files. When you read Aggregate in one file and EntityRoot in another, you know they belong to the same DDD vocabulary even if neither imports the other. The vocabulary is in your head, not in the code.

Clean Code, DDD, Clean Architecture, hexagonal: every pattern in that family was tuned to this reader. The trade-offs they made (some indirection, some boilerplate, some abstraction) were paid in exchange for a real benefit: this reader's mental load went down.

The reader is still you. The reader is still real. But on the codebases that ship features regularly with an agent in the loop, this reader is not doing most of the work anymore.

III. How an agent reads code

An agent reading code does not scroll. It does not Cmd+P. It does not Cmd+click. It has none of the silent IDE machinery you take for granted.

What it has is three primitives:

grep, to find files containing a literal pattern.
glob, to find files matching a path pattern.
read with offset and limit, to load a slice of a file into context.

That's the whole toolset for navigating a codebase. Every operation that an IDE does for you in milliseconds becomes a tool call. Every tool call costs tokens (for the call, for the result, for the model to parse the result). Every file the agent reads stays in context until something pushes it out. There is no persistent project memory. The session is the memory.

This changes pattern recognition completely. The agent doesn't see similarity the way you do. It recognizes patterns by co-located names and exact shapes. If two files have parallel filenames (user.ts and user.test.ts) and parallel structure inside, the agent will pattern-match them. If one is models/user.ts and the other is __tests__/user-tests.ts, the relationship is invisible until something explicit ties them together.

Domain language doesn't bridge files for an agent the way it does for you. If Aggregate is mentioned in domain/order/aggregate.ts and EntityRoot is mentioned in infra/repository.ts, the agent has no automatic understanding that these terms are part of the same vocabulary. It might learn the connection by reading a glossary if one exists and is in context. But the vocabulary doesn't live in the agent's head between sessions, because there is no head.

Each turn, the agent rebuilds enough of the project model to do the next action. The model is rebuilt from what's currently in context, plus what it fetches with its three primitives. Everything not in context has to be re-fetched, which means re-paid in tokens.

This is the new primary reader. None of this is a deficiency of the model. It is the structural reality of how a stateless reasoning system, equipped with grep, navigates a codebase.

IV. Three places the cost shows up

Let me make this concrete with three patterns from the original article and show what they actually cost.

A. Path aliases

// services/user.ts
import { User } from '@models/user'

export async function getUser(id: string): Promise<User> {
  // ...
}

Human path. You Cmd+click on @models/user. Your IDE knows the alias config from tsconfig.json and lands you in src/models/user.ts in one keystroke.

Agent path. The agent reads services/user.ts and sees @models/user. It does not know what that resolves to. It reads tsconfig.json. The tsconfig.json may extend a base config, so it reads that too. It identifies the paths entry that maps @models/* to src/models/*. Now it can glob for src/models/user.ts. Then it reads it.

Three file reads where the IDE did one keystroke. Multiply by every import in every file the agent has to navigate during a task.

The fix is one line of change per import:

// services/user.ts
import { User } from '../models/user.ts'

Now the path is the path. No tsconfig hop. The agent reads services/user.ts, sees ../models/user.ts, globs it, reads it. Two operations instead of three or four.

B. Deep dependency injection

function process(deps: ProcessDeps) {
  const user = deps.db.users.findById(deps.context.userId)
  deps.logger.info('processed', { user })
  deps.metrics.increment('process.success')
  return user
}

Human path. You read process(deps: ProcessDeps). You know how the app wires deps together because you wrote it. You move on.

Agent path. The agent reads process(deps: ProcessDeps). It does not know what ProcessDeps contains. It greps for ProcessDeps. It finds the interface, with eight fields, each typed as another interface. For each field used in the function body (db, logger, metrics, context), it greps for the interface definition. For each interface, it greps for the implementation. For each implementation, it may have to load multiple files to understand the actual call shape at runtime.

What started as "what does process do?" has become five to ten file reads, just to understand the four lines of code.

The fix is concrete dependencies for first-party code:

import { findUserById } from '../db/users.ts'
import { logger } from '../logging/logger.ts'
import { metrics } from '../metrics/metrics.ts'

function process(userId: string) {
  const user = findUserById(userId)
  logger.info('processed', { user })
  metrics.increment('process.success')
  return user
}

Now the function is self-evident. The imports name the actual call shape. The agent reads one file and understands.

I want to mark this case carefully, because it's the most contested in the set. Genuine multi-implementation abstractions (a StorageAdapter with S3, Local, and InMemory implementations) still earn their place. The cost they incur on the agent is real, but the benefit (the ability to swap implementations) is also real. The pattern to drop is ad-hoc DI that wraps first-party code without a multi-implementation justification. That's the version that adds five greps and gives back nothing the runtime structure didn't already say.

C. Inheritance

class PaymentProcessor extends BaseProcessor {
  // override of handle() not visible here, lives in BaseProcessor or grandparent
}

Human path. Cmd+click on paymentProcessor.handle() at the call site. The IDE jumps to the actual override location, or if there isn't one, follows the chain up.

Agent path. The agent greps for class PaymentProcessor. It reads the class definition. There is no handle() method visible. It identifies the extends BaseProcessor. It greps for class BaseProcessor. It reads that file in full. Maybe handle() is there; maybe BaseProcessor also extends something. The chain continues. Each class file gets opened in full because the agent doesn't know up front which one has the method.

To understand a single method call, the agent may read three to five class files. The MRO that your IDE renders in one click is invisible to the agent until it walks the chain itself.

The fix is composition. A function that takes the data it needs. A small file. One read instead of five.

V. Re-derivation cost has a unit

What these three examples have in common: every architectural decision now has a token receipt attached.

That's the move that matters. For thirty years, architectural cost was a soft variable. You could talk about "complexity," "cognitive load," "maintainability." These were real, but they were estimated, not measured. Two engineers could disagree about whether a layer was worth it, and there was no way to settle the argument except by accumulated experience.

Now there is. The unit is the token. Every architectural choice has a measurable receipt: how many tokens does the agent need to extend the average feature on this architecture? You can run the experiment in your own repo. You can compare the receipt across two organizations of the same code.

I want to name what we are measuring:

Re-derivation cost. The total tokens spent re-establishing a fact that the architecture didn't make immediately available. Cross-file inferences, alias hops, interface chases, MRO walks. Everything the agent has to do to know something the codebase didn't put where it could be read.

The 7.5x in the manifesto is re-derivation cost being measured. The 8.5x in cache reads is re-derivation cost compounding (the agent reloads what it already had, because something pushed it out of context). The 4.4x in output tokens is re-derivation cost spreading into the agent's own reasoning (more turns, more thinking, more rationalization).

When architectural cost has a unit, architectural decisions become testable. That changes the conversation.

VI. "But context windows are growing"

The most common objection I hear when I lay this out: "Sure, but context windows are getting bigger. 200K today, a million next year, ten million after that. Won't this whole problem just go away?"

It's a fair objection and the answer is "no, and here's why."

Context window size and attention quality are two different things. A model with a million-token context can technically read your entire repo into one prompt. The question is what the model actually does with that mass of tokens.

Attention quality on compact, self-contained contexts is what models are good at. A 4K-token window with everything the model needs and nothing it doesn't: that's where you get clean reasoning and reliable outputs. The reasoning works because the relevant facts are not buried under irrelevant ones.

Attention quality on diffuse, sprawling contexts is where things degrade. A million-token context full of cross-references, mostly-irrelevant background, and only a thin thread of what's actually needed: the model has to filter, and filtering is where mistakes leak in. The needle in the haystack problem is real even when the haystack fits.

The architectures that score well on re-derivation cost happen to be the same architectures that produce compact, self-contained contexts. Co-located feature directories. Files under three hundred lines. Imports without indirection. These principles aren't about saving raw bytes. They are about giving the model a context that's tight, relevant, and self-contained, regardless of how big the window technically is.

So no, I don't think context-window growth deprecates this. If anything, it makes the principles more durable. Because as windows grow, the temptation is to throw more of the codebase at the model. The temptation is wrong. The same principles that minimized re-derivation cost on a small window minimize attention spread on a large one.

VII. What comes next

Now we have the mechanism. The next article gets prescriptive: which specific patterns score badly on the new axis, and what to do about them.

I want to flag the framing upfront. The next article is not "delete these patterns from your codebase." It is "these patterns paid for themselves on the old axis, and the trade-off changed on the new one." Surgical, not demolition. You will keep some of them. You will drop others. The point of the article is to let you make that choice with the cost made visible.

If you want to try the experiment from the manifesto before the next article ships, do. The principle is the same: pick a feature, ask the agent to extend it, count the tokens. The receipt is the argument.

Next in the series: "What to unlearn: the most expensive patterns from the human era."

Grounded Code: Your codebase has a new primary reader. And it’s paying 7.5x more.

jucelinux — Sun, 10 May 2026 00:50:30 +0000

Same feature, same verification, two architectures. What three coding-agent codebases converge on, and why Clean Architecture isn’t wrong, just answering the old question.”

First of five articles in the Grounded Code series. Architecture principles for codebases primarily extended by agents.

I. The number nobody was measuring

Same task. Same agent (Claude). Same final output, with Verification: ✅ on both runs. The difference is in the middle column.

Metric	agent-opt	clean-ddd	ratio
Wall clock	73s	329s	4.5x
Turns	17	52	3.1x
Total tokens	230,858	1,722,500	7.5x
Cache reads	195,489	1,652,055	8.5x
Output tokens	3,882	16,941	4.4x
Verification	✅	✅	—

The first column is a codebase organized around conventions tuned for the agent that’s going to extend the code. The second is the same feature, same stack, organized with Clean Architecture and Domain-Driven Design: layers, ports and adapters, separation by responsibility, ubiquitous language.

The “clean” version needed 4.5x the wall clock, 3.1x the turns, 7.5x the tokens. And then the one that bothered me most: 8.5x more cache reads. Each extra cache read is the agent reopening one more file to rebuild context it had already had and lost.

Here’s the detail that changed how I see the problem.

The more expensive architecture wasn’t wrong by any traditional test. The code compiled. The layers were drawn correctly. The ports did what they promised. A senior engineer would review the PR and approve it without comments. It worked. It was just expensive.

Expensive in what? Expensive in the only resource that matters when the codebase’s primary reader has changed.

II. The primary reader changed

For two decades every architectural pattern we adopted answered an unstated question: how do I make this code easy to review for a human?

Clean Code, DDD, Clean Architecture, hexagonal, ports-and-adapters, ubiquitous language. All of them optimize a reader who carries 5 to 10 files in working memory at once, infers purpose from elegance and DRY, recognizes patterns by conceptual similarity, and reuses domain language to connect concepts across distant files.

That reader still exists. But it isn’t doing the bulk of the reading anymore.

An agent reading code doesn’t scroll. It runs grep, glob, and reads narrow slices with offset+limit. Each file that isn’t in context costs tokens to re-read. Sometimes it doesn’t even fit. Patterns get recognized by co-located names and exact shapes, not by similarity. Domain language is signal only when it appears verbatim in the file the agent is currently looking at.

Every cross-file inference is a token cost. Every grep chase is a context cost. Every re-derivation of a fact already encoded somewhere else is wasted time and a wasted turn.

This is the axis that wasn’t on the chart. Let’s name it:

Re-derivation cost. The unit of architectural cost in the agent era.

The table above measures that cost with three-digit precision. The 1,722,500 tokens in the second column are, mostly, the sum of every time the agent had to re-derive something the clean architecture didn’t have where it needed it. Because the clean architecture was never optimized for that.

III. Clean Architecture isn’t wrong. It’s answering the old question.

This is where I want to be surgical, because going confrontational here would land badly and also be inaccurate.

Clean Architecture, DDD, hexagonal. These patterns optimize a real trade-off, with real cost and benefit columns:

Cost: some boilerplate, some indirection, some hops between files.
Benefit: a human can reason about the system without holding 30 details in their head.

That trade-off still holds on its own terms. The cost is real. The benefit is real. They balanced when the primary reader was a human about to review the module, debug a bad night, or onboard on a Friday morning.

What changed is that a third column showed up on the chart:

Re-derivation cost for the agent.

And on that column, the same patterns score badly. Each extra hop the human thanks you for is one more grep the agent pays for. Each DI interface the human reads as “this is testable” is one more definition the agent has to load into context. Each layered import (controllers/payment.ts, services/payment.ts, repositories/payment.ts) the human appreciates for “separation of responsibilities” is three files opened in sequence just to extend a feature.

Clean architecture didn’t become wrong. It became expensive. And the cost is now paid in tokens, on the critical path of development, day after day, on every feature the agent extends.

The question every architectural decision has to answer changed. It used to be is this elegant? is this DRY? is this testable in isolation?. Now it’s:

When the agent needs to extend feature X, can it find everything about X with one glob, read it in a small number of files under 300 lines, and trust types plus spec for the contract, without leaving the feature directory?

If yes, agent-friendly architecture.

If no, it isn’t. Doesn’t matter how clean it looks in human review.

IV. Triangulation: three codebases, three teams, three convergences

The honest objection to all of this is: “OK, but you pulled this conclusion from one codebase, or from your codebase?”

Fair objection. I’ll answer with actual evidence before going on to the prescriptive part.

I audited three coding-agent codebases built by independent teams, with no coordination between them:

Claude Code (Anthropic): sourcemap dump, March 2026.
opencode (SST): production dev branch.
pi (Pi Labs): packages/coding-agent, packages/agent, packages/ai.

Three teams. Three stacks with subtle differences (all TypeScript, but around different runtimes, extension formats, and plugin models). Zero cross-communication. The three codebases converge on the same structure:

Directories by feature, not by layer. src/tools/BashTool/ in Claude Code holds implementation, prompt, path validation, permission logic, and UI all in the same place. The equivalents in opencode and pi follow the same mold.
Files under 300 lines for what the agent will modify. Orchestrators can be larger; feature code can’t.
Relative imports with explicit extensions, no path aliases. ../utils/foo.ts, not @utils/foo.
Comments as an incident-and-rationale channel, not as explanatory prose about what the code does. (“Lazy require to avoid circular dependency with telemetry.ts”: yes. “This function adds two numbers”: no.)
A spec document next to the feature, with parseable frontmatter.
A five-step loop (spec → plan → implement → verify → consolidate) showing up in three slightly different flavors.

Convergence without communication is the strongest empirical signal I have. It isn’t proof. It’s convergence. But it’s the kind of convergence three disciplined teams, working in parallel, rarely produce by accident.

The caveat that matters. All three are coding agents. Some of these conventions may be specific to that subculture. The series will mark, on every principle, how far the evidence travels: what came from the three coding agents, what survives in general-purpose backend and frontend applications, and what’s still informed speculation waiting for a test. I won’t ship generalizations I haven’t validated.

V. Grounded Code

Every new category needs a name. Without one it stays “some conventions these teams use” and never crosses over to the next codebase.

The name:

Grounded Code. Each feature carries its own ground: a spec next to the code, close enough that the agent never has to derive it.

The central image is the anchor. The agent, reading through narrow slices, is always one file away from knowing:

What this feature does (purpose:).
What the public contract is (public_api:).
What’s out of scope (out_of_scope:, a field tests-first never had).
How to verify (verification:).

When context fills up and the summary has to cut, the spec survives. When the session compacts, the agent gets re-anchored in seconds. When you come back to the project two weeks later, the spec is where you read first.

The spec is the ground. The code is the house. The test is the gate. All three live on the same plot, and the agent never has to leave it to understand what’s going on.

The name also resolves an important asymmetry: the natural opposite of grounded is floating. Code that drifts, that has to be re-anchored every turn, that pays tokens to re-derive what could have been read once. That’s exactly the cost the first table measured.

VI. What’s coming in the series

Six articles. Each one stands alone, but they’re meant to be read in sequence:

This one. The manifesto and the number.
The primary reader changed. The mechanism behind the cost. How agents actually read code, why re-derivation cost has a measurable unit now (the token), and why bigger context windows don't fix it.
What to unlearn. The anti-patterns inherited from the human-optimized world: deep DI, inheritance, premature abstraction, codegen, path aliases, implicit decorators. Each one with the case for why it worked before, and why it costs now.
The central artifact. The full format of <feature>.spec.md, with a real example, and the innovation tests-first never had: the out_of_scope field.
The five-step loop. Spec → plan mode → implement → verify → consolidate. Why step 2 (read-only plan mode) is the one that saves the most time, and why skipping step 1 is the most expensive failure mode.
The ten principles. Grouped into three clusters (locality, contracts, quarantine), with code as proof in each.

In every article I’ll mark how far the evidence travels. Where the three coding agents converge, I’ll say so. Where only Claude Code carries the pattern, I’ll say so. Where I’m speculating outside the subculture, I’ll say so.

VII. The five-minute test

Before you close this tab, an experiment. Five minutes.

Pick a feature from your project. Any one. An endpoint, a component, a lambda. Something you’d have to extend tomorrow.

Ask your agent to extend it. Count:

Total tool calls.
Turns until verification passes.
Tokens consumed (Claude Code’s /cost shows them; other harnesses vary).

Then apply only principle #1: co-locate everything about the feature in a single directory, with parallel names. Don’t change the code. Just move files and fix imports.

Ask again.

If the numbers drop, even half as much as they did in the benchmark above, you just measured a margin that was hidden in your project. If they don’t, principle #1 doesn’t apply to your case, and you just helped me calibrate the scope of the series.

Either way, the number you get back is the argument I don’t have to make.

Next in the series: “What to unlearn: the most expensive patterns from the human era.”

If you want to know when the next one ships, [follow me on dev.to] / [subscribe to the newsletter].

Playbook do Micro SaaS

jucelinux — Sat, 18 Oct 2025 19:47:18 +0000

🧠 Framework de Criação de um Micro-SaaS

1. Fundamentos do Micro-SaaS

Definição:

Um Micro-SaaS é uma aplicação extremamente focada em um nicho específico — normalmente uma “feature de uma feature” — construída por uma pessoa ou um time muito pequeno, com alta margem de lucro (80-90%) e custo operacional baixíssimo.

Características-chave:

Nicho de público e de produto.
Alta rentabilidade e baixo custo.
Escalável sem exigir dezenas de pessoas.
Possível de ser criada em um fim de semana.
Ideal para founders solo ou times minúsculos.

2. O “Playbook” Micro-SaaS

Etapa 1 – Encontrar o Itch (a coceira)

Identifique uma dor real e específica — de preferência, uma que você mesmo sente.

Valide com uma simples publicação (“tweet validation”):

“Se existisse uma ferramenta que fizesse X, você usaria?”

Dica: comece pela sua dor.

Etapa 2 – Construir a Audiência / Comunidade

Antes mesmo de lançar o produto:

Compartilhe ideias e aprendizados (LinkedIn, X, TikTok).
Crie discussões em comunidades (Reddit, Slack, Discord).
Observe quais dores surgem repetidamente.

Etapa 3 – Criar o MVP em 48h

Monte um MVP em um fim de semana.

Use ferramentas que maximizam velocidade:

Next.js, Supabase, Cloudflare Workers, LangChain, Vercel, Stripe, etc.

Automatize tudo o que puder e evite contratar.

Etapa 4 – Construir em Público

Compartilhe o progresso diariamente:

Mostre dashboards e pequenos aprendizados.
Fale das dores, erros e conquistas.
Essa transparência gera empatia e viraliza.

Etapa 5 – Distribuição & Growth Flywheel

O ciclo virtuoso proposto:

Crie uma comunidade.
Aprenda as dores dessa comunidade.
Construa algo que resolve essas dores.
Gere boca-a-boca + receita recorrente.
Reinvista a receita em audiência e comunidade.

Com o tempo, o efeito composto começa a rodar sozinho.

3. Jornada do Usuário (User Journey)

Usuário realiza um fluxo manual doloroso.
Busca uma solução (Google, ChatGPT, Perplexity).
Encontra seu micro-SaaS (bom domínio ajuda).
Tem o “wow moment”: o produto funciona de verdade.
Faz o trial, percebe valor e paga.
Indica a amigos ou ganha créditos por isso.
O tráfego e SEO crescem organicamente.

4. Métricas do Funil

Exemplo de fluxo:

10.000 visitantes/mês
1.500 trials
300 usuários ativos
150 pagantes
30 defensores/advocates

O segredo não é apenas os pagantes, mas o topo do funil (audiência + comunidade).

5. Estratégia de Monetização

Cobre desde o primeiro dia.
A confiança vem do valor, não da gratuidade.
Prefira recorrência (SaaS) ou uso-baseado.
Dê 20% de comissão para afiliados.
Plataformas úteis: Rewardful, Lemon Squeezy, Paddle.

6. Controle de Churn (Cancelamentos)

Para reter usuários:

Entregue valor contínuo (“ship daily”).
Melhore toda semana.
Entenda que churn alto no início é natural.
Foque em agregar valor sempre, não só vender.

7. Onde Inovar

Pense nos quatro quadrantes:

Tipo	Alcance	Polimento	Exemplo
DIY Script	Amplo	Baixo	GitHub repos
Plugin	Nichado	Baixo	Shopify App
Big SaaS	Amplo	Alto	Salesforce
Micro-SaaS (ideal)	Nichado	Alto	Bank Statement Converter

8. Heat Map de Oportunidades

Priorize:

Alta demanda / Poucas ferramentas.
Evite mercados com alta competição de VC-backs (grandes startups).
Use Google Trends, IdeaBrowser, Product Hunt, Reddit para descobrir lacunas.

9. O Jogo Infinito

Melhore o produto constantemente.
Seja gentil com seus usuários.
Responda perguntas no seu nicho (X, Reddit, grupos).
Cresça 5-10 seguidores por dia.
Mantenha o foco: criar valor e gerar impacto.

10. Ideias-modelo citadas no vídeo

Bank Statement Converter – converter extratos bancários em Excel.
Pod Scriptor – gera automaticamente clipes e notas de podcast.
Spec Sheet – compara PDFs técnicos e extrai insights.
Cart Saver – vídeos personalizados de recuperação de carrinho.
Grant Guru – gera propostas de grant completas com IA.

11. Radar de Ideias (Paixão × Monetização)

Ideal:

Paixão alta
Potencial de monetização alto

Evite:

Alta paixão / Baixa monetização (hobby caro)
Baixa paixão / Alta monetização (desgaste rápido)

12. Mentalidade do Fundador

Jogue o jogo longo: “infinite game”.
Reinvista o lucro.
Construa portfólio de micro-SaaS (multi-preneur).
Seja transparente, aprenda em público e melhore sempre.

🔁 Resumo visual (Growth Flywheel)

[Audiência] 

     ↓

[Aprendizado de Dores]

     ↓

[Produto Nichado]

     ↓

[Wow Moment + Retenção]

     ↓

[Receita Recorrente]

     ↓

[Reinvestir em Audiência]

     ↺ (loop infinito)

Lema Final

“Inove em um nicho.

Construa em público.

Venda valor real.

E repita infinitamente.”

AI Agents for Marketing: A Real-World Content Automation Case Study

jucelinux — Sun, 05 Oct 2025 22:33:27 +0000

The Challenge: 45 Articles per Month Without Becoming a Generic Content Factory

Creating 45 quality articles per month sounds crazy, right? That was my exact thought when I set this goal for AffiliateOS – my experiment in AI-powered affiliate marketing.

Speed alone is not the problem. The real challenge is keeping editorial quality, a unique voice, and authenticity at scale. Off-the-shelf AI tools (Jasper, Copy.ai, and the like) spit out content fast, but everything sounds the same – that bland corporate tone no one wants to read.

The hypothesis I decided to test: what if, instead of one generalist AI, I built eight specialized agents? Each would have a specific responsibility, working together like a real marketing team.

This article documents the journey. It is not a step-by-step tutorial nor a puffed-up success story. It is an honest account of how I built the system, the technical decisions, the challenges I hit, and the lessons I learned along the way.

Why Content Automation Is So Hard

Modern content marketing faces a cruel dilemma: scale versus quality.

You need to publish consistently to rank on Google. But every article must be unique, useful, and engaging if you want to convert readers. Templates help with consistency but kill personality. Generic AI is fast, but the output has no soul.

Here is what does NOT work:

Rigid templates: every review looks identical. Readers notice. Google does too.
Generic prompts: “Write an article about X” yields shallow, predictable content.
AI without context: hallucinates technical data, ignores niche nuances, uses the wrong jargon.
No process: writing straight away leads to weak structure and poor SEO.

What I needed was a system with a point of view – one that deeply understands each stage of content creation and preserves quality through specialization.

The Solution: An Architecture of 8 Specialized Agents

Think of every agent as a specialist within a marketing team. The planning-agent is the creative strategist. The content-agent is the writer. The seo-agent is the performance analyst. Each excels at one job.

Here is the complete architecture:

🧠 Orchestrator Agent – Coordinates the entire pipeline and makes strategic prioritization decisions.

📝 Planning Agent – Collaborates with you about the article idea and creates a detailed creative brief (concept, angle, tone, bespoke structure).

🗺️ Routing Agent – Chooses the best URL by analyzing niche benchmarks (for instance, /coupons/ converts 4× better than /reviews/ according to VPNMentor data).

🔍 Research Agent – Looks up up-to-date product data, studies the competition, and validates information before creation begins.

✍️ Content Agent – Produces the article using the creative brief and a custom structure (not generic templates).

🎯 SEO Agent – Optimizes meta tags, headings, and keywords without breaking the content’s personality.

🚀 Publisher Agent – Handles Git commits, automated deploys, and publication validation.

📊 Analytics Agent – Monitors performance and suggests data-backed optimizations (planned – kicks in after initial validation).

The full flow is Planning → Routing → Research → Content → SEO → Publish → Analytics.

From the initial concept to automatic deploy it takes 20–35 minutes. But the magic is not the speed – it is the quality maintained through specialization.

🛠️ Tech Box: Agent Architecture and Prompts

Each agent is a .md file inside .claude/agents/ with a specialized prompt. They are not scripts – they are refined instructions that guide Claude 4.5 Sonnet through the Claude Code CLI.

Code stats:

8 agents defined (7 implemented, 1 planned)
3,170 lines of specialized prompts
Largest agent: content-agent (790 lines)
Smallest agent: seo-agent (305 lines)

Why specialized prompts beat generic ones:

Specific context: Planning-agent knows requirement elicitation strategies. SEO-agent knows meta description formulas.
Clear guardrails: Publisher-agent has a critical rule – NEVER commit without explicit approval.
Consistent output: Each agent produces structured JSON/Markdown consumed by the next agent.
Iterative refinement: Prompts evolve with real-world use. Planning v2.0 is 40% better than v1.0.

Excerpt from the Planning Agent:

## YOUR ROLE
You are the planning-agent of AffiliateOS. Your role is to collaborate
with the user to define a complete creative brief for content creation.

## IMPORTANT PRINCIPLES
1. NEVER suggest generic templates. Each article must have unique structure.
2. ALWAYS ask clarifying questions before assuming intent.
3. BALANCE technical depth with accessibility based on target audience.
4. DEFINE specific tone - "professional" is too vague.

Tools available to the agents:

WebSearch: fetch up-to-date information
WebFetch: extract content from specific URLs
Read/Write: manipulate project files
Grep/Glob: search patterns in the codebase
Bash: run commands (git, npm, deploy)

How agents coordinate:
The orchestrator-agent manages the handoffs. When the planning-agent generates creative-brief.json, the orchestrator checks the JSON structure before passing it to the routing-agent. This prevents cascading failures.

Planning Agent: How the Collaborative Dialogue Works

This is where AffiliateOS v2.0 got interesting. I initially had rigid templates (review.md, comparison.md). They worked, but every piece looked the same. Pasteurized websites.

The solution was the planning-agent – it dialogues with you before writing a single line.

How it works in practice:

You: “I want to create content about NordVPN.”

Planning Agent: “Great! Let me understand. What’s the goal of this content?”

You: “I want to show how NordVPN is good for privacy.”

Planning Agent: “Nice angle. Who are you targeting? Beginners or tech-savvy users?”

You: “More beginners – people concerned about privacy but without technical knowledge.”

Planning Agent: “Got it. And the tone? Educational and approachable, or a more aggressive ‘stop being tracked NOW’ vibe?”

You: “Educational, no alarmism.”

Planning Agent: “Understood. I’ll craft a brief for an educational article on privacy with NordVPN, aimed at beginners, approachable tone. Suggested structure: [...]”

The outcome is a complete creative brief in JSON with:

A unique article concept
A specific editorial angle
Personalized tone
Custom structure (no generic template)
Keywords aligned with intent
Quality guidelines

Real example of the brief generated for THIS article:

{
  title: "Agentes de IA para Marketing: Um Estudo de Caso Real",
  concept: "Documentar a jornada de construção do AffiliateOS",
  angle: "Estudo de caso autêntico e educacional. Não é tutorial nem case inflado.",
  tone: "Mentor técnico acessível - entende tecnologia mas explica claramente.",
  structure: {
    type: "layered",
    description: "Camada 1 (narrativa) acessível a todos. Camada 2 (boxes técnicos) para devs."
  }
}

Switching to this format eliminated pasteurization. Every article now has its own personality, a structure tuned to its specific goal, and a tone suited to the audience.

Routing Agent: Data-Driven URL Decisions

Another crucial learning: the route matters as much as the content.

When I launched the first pieces I used /reviews/ for everything. Semantically it made sense, but it ignored a market reality: different URLs convert differently.

Real benchmark example:

Studying VPNMentor (the VPN niche leader) I found:

/coupons/nordvpn converts 4× more than /reviews/nordvpn
/best-vpn-for-streaming ranks better than /vpn-reviews/streaming
/tools/ has 2× the CTR of /resources/ for utility content

The routing-agent automates this analysis. It:

Reviews niche benchmarks (VPNMentor for VPN, NerdWallet for fintech, swyx.io for tech personal brands)
Consults routing-config.json with conversion metrics by route
Chooses the best URL based on keyword intent and conversion potential
Justifies the decision with data (no guesswork)

For THIS article, it chose /blog/ because:

That’s the go-to pattern for personal tech brands (swyx.io, kentcdodds.com, leerob.com)
Aligns with the goal of building authority (not immediate conversion)
SEO-friendly for informational long-tail keywords
Simplicity wins over over-engineering (I don’t need a separate /case-studies/ yet)

🗺️ Tech Box: How the Routing Agent Analyzes Benchmarks

The routing-agent doesn’t invent decisions. It consults the routing-config.json file which maps site routes to performance metrics.

Structure of routing-config.json:

{
  active_routes: [
    {
      path: "/blog",
      purpose: "Artigos de opinião, insights e reflexões técnicas",
      benchmark_sites: ["swyx.io", "kentcdodds.com", "leerob.io"],
      notes: "Authority content - builds personal brand and expertise"
    },
    {
      path: "/reviews",
      purpose: "Reviews honestos de ferramentas e SaaS",
      benchmark_sites: ["techradar.com", "theverge.com/reviews"],
      notes: "Affiliate content - monetization while maintaining authenticity"
    }
  ],
  route_performance_insights: {
    "/blog": {
      typical_ctr: "1-2%",
      intent: "informational",
      conversion_stage: "awareness-top-funnel",
      seo_potential: "high - thought leadership keywords"
    },
    "/reviews": {
      typical_ctr: "4-6%",
      intent: "commercial",
      conversion_stage: "consideration-decision",
      seo_potential: "high - product keywords"
    }
  }
}

Decision algorithm:

Parse creative brief: identify goal (monetize vs build authority), audience, and keywords
Match intent: informational keywords → awareness routes. Commercial keywords → decision routes.
Check benchmarks: see what niche leaders do for similar content
Score routes: expected CTR × SEO potential × goal alignment
Output: chosen route + data-backed justification

Benefit:
Optimized conversion paths from day one. I don’t have to wait six months of A/B testing – I leverage what the niche champions already learned.

Limitation:
Benchmarks are third-party data, not mine yet. Once I have traffic, the analytics-agent will refine decisions using first-party metrics.

Tech Stack: Astro, TypeScript, Tailwind, and Claude

Choosing the right stack was crucial. I didn’t want a heavy JS framework (Next.js) or a bloated CMS (WordPress). I needed:

Outstanding performance (static site, zero unnecessary JS)
Type safety (content validated at build time, not production)
Top-tier developer experience (fast hot reload, reusable components)
Native SEO (automatic sitemap, meta tags, canonical URLs)

The stack I chose:

🚀 Astro 5.14.1 – the main framework for static sites. Why?

Static site generation (SSG) → blazing-fast sites
Zero JS by default – only loads JS when needed
Partial hydration for React – interactive components only where necessary
Whole-site builds in 30–60 seconds

⚛️ React 18.3.1 – only for interactive components. Why?

ThemeToggle (dark mode) needs state
Dialog/Modal (Radix UI) needs interactivity
Everything else is plain Astro (faster)

📘 TypeScript 5.4.5 – type safety everywhere. Why?

Content Collections with Zod schema validate frontmatter
Typed components prevent bugs
Perfect autocomplete in VS Code

🎨 Tailwind CSS 3.4.18 – utility-first CSS. Why?

Consistent design system via CSS tokens
Built-in dark mode (dark: prefix)
Reusable components with CVA (Class Variance Authority)

🤖 Claude 4.5 Sonnet – the brain behind the agents. Why?

Superior reasoning for complex tasks (planning, routing)
Large context window (200k tokens) – digests entire documentation
Tool integration through Claude Code CLI

☁️ Cloudflare Pages – hosting and deploy. Why?

Automatic deploys via Git push
Global edge network (worldwide CDN)
Zero setup required
Free for personal projects

Result: Sites with Lighthouse scores of 95+ in Performance, load under 1 second, and JS bundles of only 50–80 KB (gzipped).

🏗️ Tech Box: Content Collections and Type Safety

One of Astro’s best features is Content Collections – type-safe content management with build-time validation.

How it works:

src/content/config.ts defines the Zod schema:

import { defineCollection, z } from 'astro:content';

const contentSchema = z.object({
  title: z.string(),
  description: z.string().max(160),
  slug: z.string().optional(),
  route: z.string(), // Dynamic routing (v2.0)
  rating: z.number().min(0).max(5).optional(),
  publishDate: z.coerce.date(),
  updateDate: z.coerce.date().optional(),
  category: z.string(),
  tags: z.array(z.string()).min(1),
  author: z.string(),
  featured: z.boolean().default(false),
  affiliateLink: z.string().url().optional(),
  schema: z.record(z.unknown()).optional(),
  faq: z.array(z.object({
    question: z.string(),
    answer: z.string(),
  })).optional(),
});

const articles = defineCollection({
  type: 'content',
  schema: contentSchema,
});

export const collections = { articles };

What this guarantees:

✅ Description never exceeds 160 characters (meta tag limit)
✅ Tags always present (at least one)
✅ Rating between 0–5 (never 6 or -1)
✅ AffiliateLink always a valid URL
✅ Route required (dynamic routing v2.0)

Real benefits:

Errors caught at build time, not in production
Autocomplete for fields while writing frontmatter in VS Code
Impossible to publish malformed content – build fails if schema is invalid
Safe refactoring – change the schema and all articles update

Path aliases for clean imports:

// tsconfig.json and astro.config.mjs
{
  "@/": "src/",
  "@ui": "src/components/ui/",
  "@content": "src/content/",
  "@design-system": "src/design-system/"
}

Usage:

import { Button } from '@ui/Button'; // Not '../../../components/ui/Button'
import { cn } from '@/lib/utils'; // Not '../../lib/utils'

Outcome: A clean, type-safe codebase with enterprise-grade developer experience.

The End-to-End Flow: From Idea to Deploy in 30 Minutes

Let’s follow a real article from start to finish to see how the agents collaborate.

Scenario: I want to create a piece about “NordVPN for Netflix.”

Phase 1: Planning (5–10 minutes)

Me: “I want to create content about using NordVPN to watch Netflix from other countries.”

Planning-agent conversation:

What’s the objective? (inform vs convert)
Audience? (beginners vs tech-savvy)
Tone? (educational vs promotional)
Structure? (tutorial vs review vs comparison)

Output: creative-brief.json with a unique concept, editorial angle, personalized tone, and custom structure.

Phase 2: Routing (2–3 minutes)

Routing-agent analyzes benchmarks:

VPNMentor uses /coupons/nordvpn-netflix (high conversion)
TechRadar uses /how-to/use-vpn-for-netflix (strong SEO)
My goal: conversion (affiliate)

Decision: /coupons/nordvpn-netflix because conversion > traffic volume.

Output: routing-decision.json with the chosen route and a data-backed justification.

Phase 3: Research (3–5 minutes)

Research-agent gathers data:

Current NordVPN pricing (via WebSearch)
Streaming catalogs per country (official sources)
User reviews (Reddit, Trustpilot)
Technical specs (servers, speed)

Output: research-report.json with verifiable data and sources.

Phase 4: Content Generation (5–8 minutes)

Content-agent writes the article:

Follows the creative brief structure (not a generic template)
Uses data from the research report
Keeps the tone defined in planning
Includes affiliate disclaimers
Uses Lucide icons (not emojis)

Output: Full MDX article with validated frontmatter.

Phase 5: SEO Optimization (2–3 minutes)

SEO-agent tweaks:

Meta description 150–160 characters
Title tag 50–60 characters
Proper heading hierarchy (H1 → H2 → H3)
Keyword density 1–2% (naturally used)
Schema markup (Article structured data)
Internal links to related pieces

Output: Optimized article without losing personality.

Phase 6: Publishing (1–2 minutes + 2–5 deploy)

Publisher-agent:

Saves to src/content/articles/coupons-nordvpn-netflix.md
Updates routing-config.json
Creates a descriptive Git commit
Waits for human approval (critical rule!)
Pushes to main → automatic deploy via Cloudflare Pages

Output: Live article 2–5 minutes after the push.

Total: 20–35 minutes from concept to deploy. 90% automated, 10% human review.

That human review is critical – I validate editorial quality, technical accuracy, and tone before approving the commit.

Design System: Why OKLCH Beats HSL/RGB

A technical detail that made a real difference: choosing the OKLCH color space instead of HSL or RGB for the design system.

Why it matters:

HSL has a subtle but critical issue: perceived lightness inconsistency. A yellow hsl(60, 100%, 50%) looks much brighter than a blue hsl(240, 100%, 50%), even though both have L=50%.

That breaks dark mode – you have to manually tweak L for every color to balance it.

OKLCH fixes this.

OKLCH (Lightness, Chroma, Hue) was designed for perceptual consistency. L=70% in any color has the same perceived brightness to the human eye.

Real benefits:

✅ Balanced dark mode with zero manual tweaks
✅ Smooth, predictable color transitions
✅ Better accessibility (consistent contrast)
✅ Cleaner code (no magic numbers)

Implementation:

/* src/design-system/tokens.css */
:root {
  /* Light mode */
  --background: oklch(1.0000 0 0);
  --foreground: oklch(0.3588 0.1354 278.6973);
  --primary: oklch(0.6056 0.2189 292.7172);
  --border: oklch(0.9299 0.0334 272.7879);
}

[data-theme="dark"], .dark {
  --background: oklch(0.2077 0.0398 265.7549);
  --foreground: oklch(0.9299 0.0334 272.7879);
  --primary: oklch(0.6056 0.2189 292.7172);
}

Tailwind consumes these tokens via var(--background) automatically.

Other design tokens:

Typography: Roboto (UI), Playfair Display (headings), Fira Code (code)
Border radius: four levels (sm, md, lg, xl) derived from --radius: 0.625rem
Shadows: seven levels (2xs → 2xl) with OKLCH consistency
Dark mode: React toggle with localStorage persistence

🎨 Tech Box: Components with Class Variance Authority

One of the best React component patterns I found: Class Variance Authority (CVA).

CVA lets you define component variants in a typed, composable way.

Example: Button Component

// src/components/ui/Button.tsx
import { cva, type VariantProps } from 'class-variance-authority';
import { cn } from '@/lib/utils';

const buttonVariants = cva(
  'inline-flex items-center justify-center gap-2 whitespace-nowrap rounded-md text-sm font-medium ring-offset-background transition-colors focus-visible:outline-none focus-visible:ring-2 disabled:pointer-events-none disabled:opacity-50',
  {
    variants: {
      variant: {
        default: 'bg-primary text-primary-foreground hover:bg-primary/90',
        destructive: 'bg-destructive text-destructive-foreground hover:bg-destructive/90',
        outline: 'border border-input bg-background hover:bg-accent',
        ghost: 'hover:bg-accent hover:text-accent-foreground',
      },
      size: {
        default: 'h-10 px-4 py-2',
        sm: 'h-9 rounded-md px-3',
        lg: 'h-11 rounded-md px-8',
        icon: 'h-10 w-10',
      },
    },
    defaultVariants: {
      variant: 'default',
      size: 'default',
    },
  }
);

export const Button = ({ variant, size, className, ...props }) => {
  return (
    <button
      className={cn(buttonVariants({ variant, size, className }))}
      {...props}
    />
  );
};

Usage:

<Button variant="default" size="lg">View Offer</Button>
<Button variant="outline">Learn More</Button>
<Button variant="ghost" size="icon">×</Button>

Benefits:

✅ Type-safe: TypeScript ensures variant is valid
✅ Composable: combine variants with custom classes via className
✅ Consistent: all button variations share the same base classes
✅ Maintainable: change every button style by editing one file

cn() utility to merge classes:

// src/lib/utils.ts
import { type ClassValue, clsx } from 'clsx';
import { twMerge } from 'tailwind-merge';

export function cn(...inputs: ClassValue[]) {
  return twMerge(clsx(inputs));
}

It combines clsx (conditional classes) with tailwind-merge (resolves Tailwind class conflicts). Essential for components that accept a custom className.

Result: A professional, typed, reusable component library – shadcn/ui quality.

Real Results (No Hype)

The project is still early. I’ll share honest numbers – no embellishments, nothing hidden.

What’s been built so far:

✅ 8 specialized agents (7 implemented, 1 planned)
✅ 3,170 lines of refined prompts
✅ 2 active niches (vpn-saas + jucelinux personal brand)
✅ 1 site live (vpn-reviews-br.pages.dev)
✅ Full stack: Astro + React + TypeScript + Tailwind + Claude
✅ Dynamic routing v2.0 with automated benchmarks
✅ OKLCH design system with dark mode
✅ Full automated pipeline (20–35 minutes concept→deploy)

Production speed:

Initial goal: 45 articles/month
Time per article: 20–35 minutes (automation)
Human review: ~10 minutes per article (quality validation)
Real total: 30–40 minutes per article (including review)

Perceived quality:

Content Collections + Zod: zero malformed articles published
Planning v2.0: killed pasteurization – every article has personality
Lucide icons: professional look
OKLCH design tokens: perfect dark mode with zero tweaks

Challenges faced:

❌ Agent synchronization – ensuring one agent’s output is valid input for the next
❌ Data validation – AI can hallucinate specs if WebSearch isn’t used
❌ SEO vs personality – optimizing without keyword stuffing
❌ Routing decisions without first-party data – relying on third-party benchmarks initially

What worked well:

✅ Collaborative planning – dialogue before writing
✅ Benchmark-driven routing – data-driven decisions from day one
✅ Type safety via Content Collections – build catches issues early
✅ Specialized prompts – higher quality than generic AI
✅ Human guardrails – publisher never commits without approval

What still fails (for now):

⚠️ Analytics-agent not implemented yet – waiting for real traffic
⚠️ Internal linking automation – still manual
⚠️ A/B testing for creative briefs – need more volume first

SEO metrics: waiting for the first 30–60 days of indexing. Realistic goal: 60% of articles ranking top 50 within 90 days for long-tail keywords.

Revenue: still zero (waiting on affiliate program approvals). Initial target: R$50–200/month to prove the system works.

Not a success story (yet). It’s an ongoing experiment with promising results.

Lessons Learned: 7 Practical Insights

1. Collaborative Planning Eliminates Pasteurization

Problem: Rigid templates (review.md, comparison.md) made every piece identical.

Solution: Planning-agent talks before writing. Every article has a unique creative brief with a custom structure.

Impact: Higher editorial quality. Sites don’t feel “pasteurized.” Readers notice the authenticity.

2. Benchmark-Based Dynamic Routing Is a Game Changer

Problem: Using /reviews/ for everything ignored the fact that /coupons/ converts 4× better (VPNMentor data).

Solution: Routing-agent analyzes leaders in the niche and makes data-driven URL decisions.

Impact: Conversion rates optimized from day one. No six-month A/B testing cycle.

3. Content Collections + Zod = Type-Safe Content

Problem: Malformed frontmatter broke builds – discovered in production.

Solution: Zod schema validates every frontmatter field at build time. Description > 160 chars? Build fails.

Impact: Zero malformed content bugs in production. Superior developer experience.

4. OKLCH > HSL/RGB for Design Systems

Problem: HSL has perceptual inconsistencies. Yellow L=50% looks brighter than blue L=50%. Dark mode is unbalanced.

Solution: OKLCH ensures consistent perceived lightness. L=70% looks equally bright in any color.

Impact: Balanced dark mode without manual tweaks. Better accessibility. Professional design.

5. Lucide Icons > Emojis for Professionalism

Problem: Emojis render differently across systems (Windows vs macOS vs Linux). Visual inconsistencies.

Solution: Lucide Icons – 400+ consistent, accessible SVG icons.

Impact: Visual consistency. Screen readers understand them. Elevates professionalism.

6. AI Needs Human Guardrails

Problem: Publisher-agent could auto-commit and publish mistakes.

Solution: Critical rule – NEVER commit/push without explicit user approval.

Impact: Controlled quality. Trust in the system. Avoids bad publications.

7. Specialized Prompts > Generic Prompts

Problem: One-size-fits-all prompt (“write an article about X”) yields generic output.

Solution: Eight agents with 3,170 lines of specialized prompts. Each excels at its task.

Impact: Higher-quality output. Fewer regenerations. Maintained consistency.

When (and When Not) to Use AI Agents

AI agents are not a universal solution. Here’s my honest decision framework:

✅ Use AI agents when:

You handle high volumes of repetitive content where each piece still needs to be unique.
Example: 50 product reviews – similar structure but different substance.

You have a clear editorial process that can be codified.
Example: Planning → Research → Writing → SEO → Publish is a repeatable pipeline.

You want to scale without hiring a big team.
Example: One person + eight agents can do the work of three or four people.

You have the expertise to review AI output.
AI produces fast but needs a human to validate quality and accuracy.

You value process consistency.
Agents always follow the process. Humans may skip steps.

❌ DON’T use AI agents when:

You need genuinely original creativity.
AI recombines known knowledge well; it struggles to invent something unheard of.

You produce a small volume of highly personalized content.
For five super-custom pieces per month, a human beats the setup overhead.

You lack the time/expertise to review output.
AI can hallucinate data. If you can’t validate, don’t use it.

You need deep emotional storytelling.
AI shines at technical/informational pieces. Human writing still wins at touching emotions.

You don’t have a clear editorial process.
If you don’t know what you want, AI won’t guess it.

Real trade-offs:

Aspect	Human	AI + Human Review
Speed	1–2 articles/day	15–20 articles/day
Creative quality	High	Medium-high
Consistency	Variable	High
Cost	High (salary)	Medium (API + review)
Required expertise	None (if good writer)	Yes (review output)
Scalability	Hard	Easy

My honest recommendation:

Use AI for scaled production + human review for quality. It’s not “AI replacing humans” – it’s “AI augmenting humans.”

For this project (affiliate marketing with 45 articles/month), AI agents make sense. For a personal blog with two or three deeply reflective posts a month, probably not.

Next Steps and Roadmap

AffiliateOS is actively under construction. Here’s what’s next:

Short term (next 30 days):

[ ] Implement analytics-agent for continuous monitoring
[ ] Produce the first 15 articles in the vpn-saas niche to validate SEO
[ ] Test 3–5 creative briefs and compare engagement
[ ] Apply to affiliate programs (Hostinger, Impact, Amazon)
[ ] Collect the first real traffic data

Mid term (60–90 days):

[ ] Refine routing decisions with first-party conversion data
[ ] Add automatic internal linking across related articles
[ ] Implement A/B testing for headlines and CTAs
[ ] Expand to a third niche (validate system in another vertical)
[ ] Optimize underperforming articles based on analytics

Long term (6–12 months):

[ ] Open source the agents (if validation is positive)
[ ] Automatic content refresh system for older pieces
[ ] Multi-language support (expand to EN and ES)
[ ] CRM integrations for lead nurturing
[ ] Predictive analysis of trending topics

Current limitations:

⚠️ Analytics-agent depends on real traffic (still waiting)
⚠️ Internal linking is still manual (tedious at scale)
⚠️ Content refresh isn’t automated
⚠️ Multi-language needs dedicated agents per language
⚠️ Technical data validation could be more robust

Long-term vision:

Turn AffiliateOS into an open framework for creators and agencies that want to use AI responsibly and effectively in content marketing. It’s not about “replacing writers” – it’s about giving creators superpowers.

Conclusion: Intelligent Automation, Not Replacement

After building eight specialized agents, 3,170 lines of prompts, and running dozens of experiments, here’s my honest takeaway:

AI doesn’t replace content creators. It amplifies them.

AffiliateOS agents are not “autonomous writing robots.” They’re specialized tools that let humans produce more, better, and more consistently.

The planning-agent doesn’t invent strategy alone – it collaborates with me to structure ideas. The content-agent doesn’t write without context – it follows detailed creative briefs. The publisher-agent doesn’t deploy without approval – it waits for human validation.

The magic isn’t in eliminating humans. It’s in specialization.

Each agent masters ONE thing. Planning. Routing. Research. SEO. Publishing. That division of responsibility keeps quality high at scale – impossible with a single generalist AI or generic templates.

For marketing leaders: AI agents can 10× your content output without hiring ten more people. But you still need someone with expertise to review the output, validate accuracy, and ensure brand voice alignment.

For content creators: AI won’t steal your job. It will eliminate repetitive tasks (research, SEO optimization, formatting) and free up time for what humans do best – genuine creativity, emotional storytelling, original insights.

For developers: Architecting specialized agents is a powerful pattern. Don’t build a “super agent that does everything” – build experts that collaborate. Quality comes from specialization.

Want to Use AI to Level Up Your Content Operations?

If you run a marketing team, agency, or content operation and want to explore how AI can multiply results without losing authenticity, let’s talk.

I’m not selling a product. I’m sharing the lessons from building real AI systems applied to marketing.

Reach out if you:

Manage a content team and want to scale without sacrificing quality
Work at a digital marketing agency exploring smart automation
Are a tech lead evaluating AI for content operations
Are a creator looking to scale output without becoming a generic factory

I don’t promise magic results. I promise an honest conversation about what works, what doesn’t, and how to build responsible AI systems.

Stack mentioned:

Astro – Static site framework
Claude AI – AI model (Anthropic)
Tailwind CSS – Utility-first CSS
Lucide Icons – SVG icon system
Radix UI – Headless components
Cloudflare Pages – Hosting and CDN

Agentes de IA para Marketing: Um Estudo de Caso Real de Automação de Conteúdo

jucelinux — Sun, 05 Oct 2025 22:03:30 +0000

O Desafio: 45 Artigos por Mês Sem Virar uma Fábrica de Conteúdo Genérico

Criar 45 artigos de qualidade por mês parece loucura, certo? É exatamente o que pensei quando defini essa meta para o AffiliateOS - meu experimento de marketing de afiliados automatizado com IA.

O problema não é apenas velocidade. É manter qualidade editorial, personalidade única e honestidade em escala. Ferramentas de IA genéricas (Jasper, Copy.ai e similares) geram conteúdo rápido, mas tudo soa igual - aquele tom corporativo pasteurizado que ninguém quer ler.

A hipótese que testei: e se ao invés de uma IA generalista, eu criasse 8 agentes especializados? Cada um com uma responsabilidade específica, trabalhando em conjunto como uma equipe de marketing real.

Este artigo documenta a jornada. Não é um tutorial passo-a-passo nem um case de sucesso inflado. É um relato honesto de como construí o sistema, as escolhas técnicas que fiz, os desafios que enfrentei e o que aprendi no processo.

Por Que Automação de Conteúdo é Tão Difícil

O marketing de conteúdo moderno tem um dilema cruel: escala versus qualidade.

Você precisa publicar consistentemente para ranquear no Google. Mas cada artigo precisa ser único, útil e envolvente para converter leitores. Templates ajudam com consistência, mas matam personalidade. IA genérica é rápida, mas produz conteúdo sem alma.

Aqui está o que NÃO funciona:

Templates fixos: Todo review fica idêntico. Leitores percebem. Google também.
Prompts genéricos: "Escreva um artigo sobre X" produz conteúdo superficial e previsível.
IA sem contexto: Alucina dados técnicos, ignora nuances do nicho, usa jargão errado.
Ausência de processo: Escrever direto sem planejamento resulta em estrutura fraca e SEO ruim.

O que eu precisava era um sistema com opinião - que entendesse profundamente cada etapa da criação de conteúdo e mantivesse qualidade através de especialização.

A Solução: Arquitetura de 8 Agentes Especializados

Pense em cada agente como um especialista em uma equipe de marketing. O planning-agent é o estrategista criativo. O content-agent é o escritor. O seo-agent é o analista de performance. Cada um faz uma coisa muito bem.

Aqui está a arquitetura completa:

🧠 Orchestrator Agent - Coordena todo o pipeline e toma decisões estratégicas de priorização.

📝 Planning Agent - Dialoga com você sobre a ideia do artigo e cria um creative brief detalhado (conceito, ângulo, tom, estrutura customizada).

🗺️ Routing Agent - Decide a melhor URL analisando benchmarks do nicho (ex: /coupons/ converte 4x mais que /reviews/ segundo dados de VPNMentor).

🔍 Research Agent - Pesquisa dados atualizados de produtos, analisa concorrência e valida informações antes da criação.

✍️ Content Agent - Gera o artigo seguindo o creative brief com estrutura personalizada (não templates genéricos).

🎯 SEO Agent - Otimiza meta tags, headings e keywords sem quebrar a personalidade do conteúdo.

🚀 Publisher Agent - Gerencia commits Git, deploy automático e validação de publicação.

📊 Analytics Agent - Monitora performance e sugere otimizações baseadas em dados reais (planejado - funciona após validação inicial).

O fluxo completo é: Planning → Routing → Research → Content → SEO → Publish → Analytics.

Do conceito inicial ao deploy automático, leva 20-35 minutos. Mas a mágica não está na velocidade - está na qualidade mantida através da especialização.

🛠️ Box Técnico: Arquitetura de Agentes e Prompts

Cada agente é um arquivo .md em .claude/agents/ com prompt especializado. Não são scripts executáveis - são prompts refinados que guiam o comportamento de Claude 4.5 Sonnet via Claude Code CLI.

Estatísticas do código:

8 agentes definidos (7 implementados, 1 planejado)
3.170 linhas de prompts especializados
Maior agente: content-agent (790 linhas)
Menor agente: seo-agent (305 linhas)

Por que prompts especializados > prompts genéricos:

Contexto específico: Planning-agent conhece estratégias de elicitação de requisitos. SEO-agent sabe fórmulas de meta description.
Guardrails claros: Publisher-agent tem regra crítica: NUNCA faz commit sem aprovação explícita.
Output consistente: Cada agente produz JSON/Markdown estruturado que o próximo agente consome.
Refinamento iterativo: Prompts evoluem conforme uso real. Planning v2.0 é 40% melhor que v1.0.

Exemplo de trecho do Planning Agent:

## YOUR ROLE
You are the planning-agent of AffiliateOS. Your role is to collaborate
with the user to define a complete creative brief for content creation.

## IMPORTANT PRINCIPLES
1. NEVER suggest generic templates. Each article must have unique structure.
2. ALWAYS ask clarifying questions before assuming intent.
3. BALANCE technical depth with accessibility based on target audience.
4. DEFINE specific tone - "professional" is too vague.

Ferramentas disponíveis aos agentes:

WebSearch: Buscar informações atualizadas
WebFetch: Extrair conteúdo de URLs específicas
Read/Write: Manipular arquivos do projeto
Grep/Glob: Buscar padrões no codebase
Bash: Executar comandos (git, npm, deploy)

Coordenação entre agentes:
O orchestrator-agent gerencia handoffs. Quando planning-agent gera creative-brief.json, o orchestrator verifica estrutura do JSON antes de passar para routing-agent. Isso previne erros em cascata.

Planning Agent: Como Funciona o Diálogo Colaborativo

Aqui está onde a v2.0 do AffiliateOS ficou interessante. Inicialmente, eu tinha templates fixos (review.md, comparison.md). Funcionava, mas todo conteúdo ficava com a mesma cara. Sites pasteurizados.

A solução foi criar o planning-agent - que dialoga com você antes de escrever qualquer linha.

Como funciona na prática:

Você: "Quero criar conteúdo sobre NordVPN."

Planning Agent: "Legal! Deixa eu entender melhor. Qual o objetivo desse conteúdo?"

Você: "Quero mostrar como NordVPN é bom para privacidade."

Planning Agent: "Ótimo ângulo. Que tipo de pessoa você quer atingir? Usuários iniciantes ou tech-savvy?"

Você: "Mais para iniciantes - pessoas preocupadas com privacidade mas sem conhecimento técnico."

Planning Agent: "Perfeito. E o tom? Você prefere algo educacional e acessível, ou mais agressivo tipo 'pare de ser rastreado AGORA'?"

Você: "Educacional, sem alarmismo."

Planning Agent: "Entendido. Vou criar um brief para um artigo educacional sobre privacidade com NordVPN, focado em iniciantes, tom acessível. Estrutura sugerida: [...]"

O resultado é um creative brief completo em JSON com:

Conceito único do artigo
Ângulo editorial específico
Tom personalizado
Estrutura customizada (não template genérico)
Keywords alinhadas com intenção
Guidelines de qualidade

Exemplo real do brief gerado para ESTE artigo:

{
  title: "Agentes de IA para Marketing: Um Estudo de Caso Real",
  concept: "Documentar a jornada de construção do AffiliateOS",
  angle: "Estudo de caso autêntico e educacional. Não é tutorial nem case inflado.",
  tone: "Mentor técnico acessível - entende tecnologia mas explica claramente.",
  structure: {
    type: "layered",
    description: "Camada 1 (narrativa) acessível a todos. Camada 2 (boxes técnicos) para devs."
  }
}

Essa mudança simples eliminou pasteurização. Cada artigo agora tem personalidade própria, estrutura otimizada para seu objetivo específico e tom adequado à audiência.

Routing Agent: Decisões Data-Driven de URL

Outro aprendizado crucial: a rota importa tanto quanto o conteúdo.

Quando criei os primeiros artigos, usei /reviews/ para tudo. Fazia sentido semanticamente, mas ignorava uma realidade do mercado: diferentes URLs convertem diferente.

Exemplo real de benchmarks:

Analisando o VPNMentor (líder do nicho VPN), descobri:

/coupons/nordvpn converte 4x mais que /reviews/nordvpn
/best-vpn-for-streaming ranqueia melhor que /vpn-reviews/streaming
/tools/ tem CTR 2x maior que /resources/ para conteúdo utilitário

O routing-agent automatiza essa análise. Ele:

Analisa benchmarks do nicho (VPNMentor para VPN, NerdWallet para fintech, swyx.io para tech personal brands)
Consulta o routing-config.json com métricas de conversão por rota
Decide a melhor URL baseado em intent da keyword + potencial de conversão
Justifica a decisão com dados (não achismos)

Para ESTE artigo, a decisão foi /blog/ porque:

Padrão universal de personal brands tech (swyx.io, kentcdodds.com, leerob.com)
Alinha com objetivo de construção de autoridade (não conversão imediata)
SEO otimizado para keywords informacionais de long-tail
Simplicidade vence over-engineering (não preciso de /estudos-de-caso/ separado)

🗺️ Box Técnico: Como o Routing Agent Analisa Benchmarks

O routing-agent não inventa decisões. Ele consulta o arquivo routing-config.json que mapeia rotas do site com métricas de performance.

Estrutura do routing-config.json:

{
  active_routes: [
    {
      path: "/blog",
      purpose: "Artigos de opinião, insights e reflexões técnicas",
      benchmark_sites: ["swyx.io", "kentcdodds.com", "leerob.io"],
      notes: "Authority content - builds personal brand and expertise"
    },
    {
      path: "/reviews",
      purpose: "Reviews honestos de ferramentas e SaaS",
      benchmark_sites: ["techradar.com", "theverge.com/reviews"],
      notes: "Affiliate content - monetization while maintaining authenticity"
    }
  ],
  route_performance_insights: {
    "/blog": {
      typical_ctr: "1-2%",
      intent: "informational",
      conversion_stage: "awareness-top-funnel",
      seo_potential: "high - thought leadership keywords"
    },
    "/reviews": {
      typical_ctr: "4-6%",
      intent: "commercial",
      conversion_stage: "consideration-decision",
      seo_potential: "high - product keywords"
    }
  }
}

Algoritmo de decisão:

Parse creative brief: Identifica objetivo (monetização vs autoridade), audiência e keywords
Match intent: Keywords informacionais → rotas awareness. Keywords comerciais → rotas decision.
Consulta benchmarks: Verifica o que líderes do nicho fazem para conteúdo similar
Scoring de rotas: CTR esperado × SEO potential × alinhamento com objetivo
Output: Rota escolhida + justificativa detalhada com dados

Benefício:
Rotas otimizadas para conversão desde o primeiro dia. Não preciso esperar 6 meses de A/B testing - aproveito aprendizados de campeões do nicho.

Limitação:
Benchmarks são de terceiros, não dados próprios. Após tráfego real, o analytics-agent vai refinar essas decisões com dados próprios.

Stack Tecnológica: Astro, TypeScript, Tailwind e Claude

Escolher a stack certa foi crítico. Não queria framework JS pesado (Next.js) nem CMS complexo (WordPress). Precisava de:

Performance excepcional (static site, zero JS desnecessário)
Type safety (conteúdo validado em build time, não em produção)
Developer experience impecável (hot reload rápido, componentes reutilizáveis)
SEO nativo (sitemap automático, meta tags, canonical URLs)

A stack escolhida:

🚀 Astro 5.14.1 - Framework principal para sites estáticos. Por quê?

Static site generation (SSG) - sites absurdamente rápidos
Zero JS por padrão - só carrega JavaScript quando necessário
Hidratação parcial de React - componentes interativos apenas onde precisa
Build em 30-60 segundos para site completo

⚛️ React 18.3.1 - Apenas para componentes interativos. Por quê?

ThemeToggle (dark mode) precisa de estado
Dialog/Modal (Radix UI) precisa de interatividade
Resto é Astro puro (mais rápido)

📘 TypeScript 5.4.5 - Type safety em tudo. Por quê?

Content Collections com schema Zod validam frontmatter
Componentes tipados previnem bugs
Autocomplete perfeito em VSCode

🎨 Tailwind CSS 3.4.18 - Utility-first CSS. Por quê?

Design system consistente via tokens CSS
Dark mode built-in (dark: prefix)
Componentes reutilizáveis com CVA (Class Variance Authority)

🤖 Claude 4.5 Sonnet - IA dos agentes. Por quê?

Reasoning superior para tarefas complexas (planning, routing)
Context window grande (200k tokens) - processa documentação inteira
Tools integration via Claude Code CLI

☁️ Cloudflare Pages - Hosting e deploy. Por quê?

Deploy automático via Git push
Global edge network (CDN mundial)
Zero configuração necessária
Grátis para uso pessoal

Resultado: Sites com Lighthouse score 95+ em Performance, carregamento em < 1 segundo e bundle JS de apenas 50-80KB (gzipped).

🏗️ Box Técnico: Content Collections e Type Safety

Uma das melhores features do Astro é Content Collections - type-safe content management com validação em build time.

Como funciona:

Arquivo src/content/config.ts define o schema Zod:

import { defineCollection, z } from 'astro:content';

const contentSchema = z.object({
  title: z.string(),
  description: z.string().max(160),
  slug: z.string().optional(),
  route: z.string(), // Routing dinâmico (v2.0)
  rating: z.number().min(0).max(5).optional(),
  publishDate: z.coerce.date(),
  updateDate: z.coerce.date().optional(),
  category: z.string(),
  tags: z.array(z.string()).min(1),
  author: z.string(),
  featured: z.boolean().default(false),
  affiliateLink: z.string().url().optional(),
  schema: z.record(z.unknown()).optional(),
  faq: z.array(z.object({
    question: z.string(),
    answer: z.string(),
  })).optional(),
});

const articles = defineCollection({
  type: 'content',
  schema: contentSchema,
});

export const collections = { articles };

O que isso garante:

✅ Description nunca ultrapassa 160 caracteres (limite de meta tag)
✅ Tags sempre presentes (mínimo 1)
✅ Rating entre 0-5 (nunca 6 ou -1)
✅ AffiliateLink sempre URL válida
✅ Route obrigatória (routing dinâmico v2.0)

Benefícios reais:

Erros detectados em build time, não em produção
Autocomplete de campos em VSCode ao escrever frontmatter
Impossível publicar conteúdo malformado - build falha se schema inválido
Refatoração segura - mudar schema atualiza todos os artigos

Path aliases para imports limpos:

// tsconfig.json e astro.config.mjs
{
  "@/": "src/",
  "@ui": "src/components/ui/",
  "@content": "src/content/",
  "@design-system": "src/design-system/"
}

Uso:

import { Button } from '@ui/Button'; // Não: '../../../components/ui/Button'
import { cn } from '@/lib/utils'; // Não: '../../lib/utils'

Resultado: Codebase limpa, type-safe e com developer experience de aplicações enterprise.

O Fluxo Completo: De Ideia a Deploy em 30 Minutos

Vamos seguir um artigo real do início ao fim para ver como os agentes trabalham juntos.

Cenário: Quero criar um artigo sobre "NordVPN para Netflix".

Fase 1: Planning (5-10 minutos)

Eu: "Quero criar conteúdo sobre usar NordVPN para assistir Netflix de outros países."

Planning-agent dialoga:

Qual o objetivo? (informar vs converter)
Audiência? (iniciantes vs tech-savvy)
Tom? (educacional vs promocional)
Estrutura? (tutorial vs review vs comparação)

Output: creative-brief.json com conceito único, ângulo editorial, tom personalizado e estrutura customizada.

Fase 2: Routing (2-3 minutos)

Routing-agent analisa benchmarks:

VPNMentor usa /coupons/nordvpn-netflix (alta conversão)
TechRadar usa /how-to/use-vpn-for-netflix (alto SEO)
Meu objetivo: converter (afiliado)

Decisão: /coupons/nordvpn-netflix porque conversão > volume de tráfego.

Output: routing-decision.json com rota escolhida + justificativa com dados.

Fase 3: Research (3-5 minutos)

Research-agent coleta dados:

Preço atual do NordVPN (via WebSearch)
Catálogos Netflix por país (fontes oficiais)
Reviews de usuários (Reddit, Trustpilot)
Specs técnicas (servidores, velocidade)

Output: research-report.json com dados verificáveis e fontes.

Fase 4: Content Generation (5-8 minutos)

Content-agent cria artigo:

Segue estrutura do creative brief (não template genérico)
Usa dados do research report
Mantém tom definido no planning
Inclui disclaimer de afiliado
Usa ícones Lucide (não emojis)

Output: Artigo completo em MDX com frontmatter validado.

Fase 5: SEO Optimization (2-3 minutos)

SEO-agent otimiza:

Meta description 150-160 caracteres
Title tag 50-60 caracteres
Headings hierárquicos (H1 → H2 → H3)
Keywords density 1-2% (uso natural)
Schema markup (Article structured data)
Internal links para artigos relacionados

Output: Artigo otimizado sem quebrar personalidade.

Fase 6: Publishing (1-2 minutos + 2-5 deploy)

Publisher-agent:

Salva em src/content/articles/coupons-nordvpn-netflix.md
Atualiza routing-config.json
Cria commit Git descritivo
Aguarda aprovação humana (regra crítica!)
Push para main → deploy automático Cloudflare Pages

Output: Artigo live em 2-5 minutos após push.

Total: 20-35 minutos do conceito ao deploy. 90% automatizado, 10% revisão humana.

A revisão humana é crítica - valido qualidade editorial, accuracy de dados técnicos e tom antes de aprovar o commit.

Design System: Por Que OKLCH é Superior a HSL/RGB

Um detalhe técnico que fez diferença: escolher OKLCH color space ao invés de HSL ou RGB para o design system.

Por que isso importa?

HSL tem um problema sutil mas crítico: inconsistência perceptual de brilho. Um amarelo hsl(60, 100%, 50%) parece muito mais claro que um azul hsl(240, 100%, 50%), mesmo ambos tendo L=50%.

Isso quebra dark mode - você precisa ajustar manualmente o L de cada cor para parecerem balanceadas.

OKLCH resolve isso.

OKLCH (Lightness, Chroma, Hue) foi projetado para consistência perceptual. L=70% em qualquer cor tem o mesmo brilho percebido pelo olho humano.

Benefícios reais:

✅ Dark mode balanceado sem ajustes manuais
✅ Transições de cor suaves e previsíveis
✅ Acessibilidade melhor (contraste consistente)
✅ Código mais limpo (menos magic numbers)

Implementação:

/* src/design-system/tokens.css */
:root {
  /* Light mode */
  --background: oklch(1.0000 0 0); /* Branco puro */
  --foreground: oklch(0.3588 0.1354 278.6973); /* Roxo escuro */
  --primary: oklch(0.6056 0.2189 292.7172); /* Roxo vibrante */
  --border: oklch(0.9299 0.0334 272.7879); /* Cinza claro */
}

[data-theme="dark"], .dark {
  --background: oklch(0.2077 0.0398 265.7549); /* Roxo muito escuro */
  --foreground: oklch(0.9299 0.0334 272.7879); /* Cinza claro */
  --primary: oklch(0.6056 0.2189 292.7172); /* Roxo vibrante (mesmo!) */
}

Tailwind CSS consome estes tokens via var(--background) automaticamente.

Outros tokens do design system:

Tipografia: Roboto (UI), Playfair Display (headings), Fira Code (code)
Border radius: 4 níveis (sm, md, lg, xl) baseados em --radius: 0.625rem
Shadows: 7 níveis (2xs → 2xl) com consistência OKLCH
Dark mode: Toggle React com persistência em localStorage

🎨 Box Técnico: Componentes com Class Variance Authority

Uma das melhores patterns de componentes React que descobri: Class Variance Authority (CVA).

CVA permite definir variantes de componentes de forma tipada e composável.

Exemplo: Button Component

// src/components/ui/Button.tsx
import { cva, type VariantProps } from 'class-variance-authority';
import { cn } from '@/lib/utils';

const buttonVariants = cva(
  // Base classes (sempre aplicadas)
  'inline-flex items-center justify-center gap-2 whitespace-nowrap rounded-md text-sm font-medium ring-offset-background transition-colors focus-visible:outline-none focus-visible:ring-2 disabled:pointer-events-none disabled:opacity-50',
  {
    variants: {
      variant: {
        default: 'bg-primary text-primary-foreground hover:bg-primary/90',
        destructive: 'bg-destructive text-destructive-foreground hover:bg-destructive/90',
        outline: 'border border-input bg-background hover:bg-accent',
        ghost: 'hover:bg-accent hover:text-accent-foreground',
      },
      size: {
        default: 'h-10 px-4 py-2',
        sm: 'h-9 rounded-md px-3',
        lg: 'h-11 rounded-md px-8',
        icon: 'h-10 w-10',
      },
    },
    defaultVariants: {
      variant: 'default',
      size: 'default',
    },
  }
);

export const Button = ({ variant, size, className, ...props }) => {
  return (
    <button
      className={cn(buttonVariants({ variant, size, className }))}
      {...props}
    />
  );
};

Uso:

<Button variant="default" size="lg">Ver Oferta</Button>
<Button variant="outline">Saiba Mais</Button>
<Button variant="ghost" size="icon">×</Button>

Benefícios:

✅ Type-safe: TypeScript garante que variant é um valor válido
✅ Composável: Combina variantes com classes customizadas via className
✅ Consistente: Todas as variações de Button usam mesmas base classes
✅ Manutenível: Mudar estilo de todos os buttons = editar um arquivo

Utility cn() para merge de classes:

// src/lib/utils.ts
import { type ClassValue, clsx } from 'clsx';
import { twMerge } from 'tailwind-merge';

export function cn(...inputs: ClassValue[]) {
  return twMerge(clsx(inputs));
}

Combina clsx (conditional classes) com tailwind-merge (resolve conflitos de classes Tailwind). Essencial para componentes que aceitam className customizado.

Resultado: Biblioteca de componentes profissional, tipada e reutilizável - qualidade de shadcn/ui.

Resultados Reais (Sem Hype)

Projeto está em fase inicial. Vou compartilhar números honestos - sem inflar, sem esconder.

O que foi construído até agora:

✅ 8 agentes especializados (7 implementados, 1 planejado)
✅ 3.170 linhas de prompts refinados
✅ 2 nichos ativos (vpn-saas + jucelinux personal brand)
✅ 1 site deployado (vpn-reviews-br.pages.dev)
✅ Stack completa: Astro + React + TypeScript + Tailwind + Claude
✅ Routing dinâmico v2.0 com benchmarks automatizados
✅ Design system OKLCH com dark mode
✅ Pipeline completo automatizado (20-35 minutos conceito→deploy)

Velocidade de produção:

Meta inicial: 45 artigos/mês
Tempo por artigo: 20-35 minutos (automação)
Revisão humana: ~10 minutos por artigo (validação qualidade)
Total realista: 30-40 minutos por artigo (incluindo revisão)

Qualidade percebida:

Content Collections com Zod: zero artigos malformados publicados
Planning v2.0: eliminou pasteurização - cada artigo tem personalidade única
Ícones Lucide: consistência visual profissional
Design tokens OKLCH: dark mode perfeito sem ajustes manuais

Desafios enfrentados:

❌ Sincronização entre agentes - garantir que output de um serve como input válido do próximo
❌ Validação de dados - IA pode alucinar specs técnicas se não usar WebSearch
❌ SEO vs personalidade - otimizar sem keyword stuffing
❌ Decisões de routing sem dados próprios - confiar em benchmarks de terceiros inicialmente

O que funcionou bem:

✅ Planning colaborativo - dialoga antes de escrever
✅ Routing baseado em benchmarks - decisões data-driven desde dia 1
✅ Type safety via Content Collections - bugs detectados em build
✅ Prompts especializados - qualidade superior a IA genérica
✅ Guardrails humanos - publisher nunca faz commit sem aprovação

O que não funcionou (ainda):

⚠️ Analytics-agent ainda não implementado - preciso de tráfego real primeiro
⚠️ Internal linking automático - ainda manual
⚠️ A/B testing de creative briefs - preciso de volume para validar

Métricas de SEO: Aguardando primeiros 30-60 dias de indexação. Meta realista: 60% dos artigos ranqueando top 50 em 90 dias para long-tail keywords.

Receita: Zero até agora (aguardando aprovação de programas de afiliados). Meta inicial: R$50-200/mês como validação de que o sistema funciona.

Não é um case de sucesso (ainda). É um experimento em andamento com resultados promissores.

Lições Aprendidas: 7 Insights Práticos

1. Planning Colaborativo Elimina Pasteurização

Problema: Templates fixos (review.md, comparison.md) tornavam todo conteúdo idêntico.

Solução: Planning-agent dialoga antes de escrever. Cada artigo tem creative brief único com estrutura customizada.

Impacto: Qualidade editorial superior. Sites não ficam "pasteurizados". Leitores percebem autenticidade.

2. Routing Dinâmico Baseado em Benchmarks é Game-Changer

Problema: Usar /reviews/ para tudo ignorava que /coupons/ converte 4x mais (dados VPNMentor).

Solução: Routing-agent analisa benchmarks de líderes do nicho e toma decisões data-driven de URL.

Impacto: Taxas de conversão otimizadas desde dia 1. Não preciso de 6 meses de A/B testing.

3. Content Collections + Zod = Type Safety em Conteúdo

Problema: Frontmatter malformado quebrava builds. Descobrir em produção.

Solução: Schema Zod valida todo frontmatter em build time. Description > 160 chars? Build falha.

Impacto: Zero bugs de conteúdo malformado em produção. Developer experience superior.

4. OKLCH > HSL/RGB para Design Systems

Problema: HSL tem inconsistência perceptual. Amarelo L=50% parece mais claro que azul L=50%. Dark mode fica desbalanceado.

Solução: OKLCH garante consistência perceptual de brilho. L=70% tem mesmo brilho em qualquer cor.

Impacto: Dark mode balanceado sem ajustes manuais. Acessibilidade melhor. Design profissional.

5. Lucide Icons > Emojis para Profissionalismo

Problema: Emojis renderizam diferente entre sistemas (Windows vs Mac vs Linux). Inconsistência visual.

Solução: Lucide Icons - 400+ ícones SVG consistentes e acessíveis.

Impacto: Consistência visual. Screen readers entendem. Profissionalismo elevado.

6. IA Precisa de Guardrails Humanos

Problema: Publisher-agent poderia fazer commits automáticos e publicar erros.

Solução: Regra crítica - NUNCA commit/push sem aprovação explícita do usuário.

Impacto: Qualidade controlada. Confiança no sistema. Evita publicações ruins.

7. Prompts Especializados > Prompts Genéricos

Problema: Prompt único "escreva artigo sobre X" produz conteúdo genérico.

Solução: 8 agentes com 3.170 linhas de prompts especializados. Cada um domina sua função.

Impacto: Output de qualidade superior. Menos regenerações. Consistência mantida.

Quando (e Quando Não) Usar Agentes de IA

Agentes de IA não são solução universal. Aqui está meu framework de decisão honesto:

✅ Use agentes de IA quando:

Você tem volume alto de conteúdo repetitivo mas cada peça precisa ser única.
Exemplo: Reviews de 50 produtos - estrutura similar mas cada um é diferente.

Você tem processo editorial claro que pode ser codificado.
Exemplo: Planning → Research → Writing → SEO → Publish é um fluxo repetível.

Você quer escalar sem contratar grande equipe.
Exemplo: 1 pessoa + 8 agentes produz o que 3-4 pessoas fariam.

Você tem expertise para revisar output de IA.
IA gera rápido mas precisa de humano para validar qualidade e accuracy.

Você valoriza consistência de processo.
Agentes seguem o processo sempre. Humanos podem esquecer etapas.

❌ NÃO use agentes de IA quando:

Você precisa de criatividade genuinamente original.
IA é ótima em recombinar conhecimento, ruim em criar algo nunca visto.

Você tem volume baixo de conteúdo altamente personalizado.
Para 5 artigos/mês super customizados, humano é melhor que setup de agentes.

Você não tem tempo/expertise para revisar output.
IA pode alucinar dados. Se você não pode validar, não use.

Você quer conteúdo emocional profundo.
IA gera bem conteúdo técnico/informacional. Conteúdo que toca alma? Humano vence.

Você não tem processo editorial claro.
Se você mesmo não sabe o que quer, IA não vai adivinhar.

Tradeoffs reais:

Aspecto	Humano	IA + Revisão Humana
Velocidade	1-2 artigos/dia	15-20 artigos/dia
Qualidade criativa	Alta	Média-alta
Consistência	Varia	Alta
Custo	Alto (salário)	Médio (API + revisão)
Expertise necessária	Não (se bom escritor)	Sim (revisar output)
Escalabilidade	Difícil	Fácil

Minha recomendação honesta:

Use IA para produção em escala + revisão humana para qualidade. Não é "IA substituindo humanos" - é "IA como multiplicador de humanos".

Para este projeto (marketing de afiliados com 45 artigos/mês), agentes de IA fazem sentido. Para um blog pessoal com 2-3 posts/mês super reflexivos, provavelmente não.

Próximos Passos e Roadmap

O AffiliateOS está em construção ativa. Aqui está o que vem a seguir:

Curto prazo (próximos 30 dias):

[ ] Implementar analytics-agent para monitoramento contínuo
[ ] Gerar primeiros 15 artigos do nicho vpn-saas para validação SEO
[ ] Testar 3-5 creative briefs diferentes e comparar engagement
[ ] Cadastrar em programas de afiliados (Hostinger, Impact, Amazon)
[ ] Coletar primeiros dados de tráfego real

Médio prazo (60-90 dias):

[ ] Refinar routing decisions com dados próprios de conversão
[ ] Adicionar internal linking automático entre artigos relacionados
[ ] Implementar A/B testing de headlines e CTAs
[ ] Expandir para 3º nicho (validar sistema em outro vertical)
[ ] Otimizar artigos underperforming baseado em analytics

Longo prazo (6-12 meses):

[ ] Open-source dos agentes (se validação for positiva)
[ ] Sistema de content refresh automático (atualizar artigos antigos)
[ ] Multi-language support (expandir para EN e ES)
[ ] Integration com CRMs para lead nurturing
[ ] Análise preditiva de trending topics

Limitações atuais a resolver:

⚠️ Analytics-agent depende de tráfego real (aguardando)
⚠️ Internal linking ainda é manual (tedioso em escala)
⚠️ Refresh de conteúdo antigo não é automatizado
⚠️ Multi-language precisa de agentes específicos por idioma
⚠️ Validação de dados técnicos poderia ser mais robusta

Visão de longo prazo:

Transformar AffiliateOS em um framework aberto para criadores de conteúdo e agências que querem usar IA de forma responsável e efetiva. Não é "substituir escritores" - é "dar superpoderes a criadores".

Conclusão: Automação Inteligente, Não Substituição

Depois de construir 8 agentes especializados, 3.170 linhas de prompts e rodar dezenas de experimentos, aqui está minha conclusão honesta:

IA não substitui criadores de conteúdo. Ela multiplica.

Os agentes do AffiliateOS não são "robôs escritores autônomos". São ferramentas especializadas que fazem humanos produzirem mais, melhor e mais consistentemente.

O planning-agent não inventa estratégias sozinho - ele dialoga comigo para estruturar minhas ideias. O content-agent não escreve sem contexto - ele segue creative briefs detalhados. O publisher-agent não deploys sem aprovação - ele aguarda validação humana.

A mágica não está em eliminar humanos. Está em especialização.

Cada agente domina UMA coisa. Planning. Routing. Research. SEO. Publishing. Essa divisão de responsabilidades permite qualidade mantida em escala - algo impossível com uma IA generalista ou templates genéricos.

Para gestores de marketing: Agentes de IA podem 10x sua produção de conteúdo sem contratar 10 pessoas. Mas você ainda precisa de alguém com expertise para revisar output, validar accuracy e garantir alinhamento com brand voice.

Para criadores de conteúdo: IA não vai roubar seu trabalho. Vai eliminar tarefas repetitivas (research, otimização SEO, formatação) e liberar tempo para o que humanos fazem melhor - criatividade genuína, storytelling emocional, insights originais.

Para desenvolvedores: Arquitetura de agentes especializados é pattern poderosa. Não tente fazer um "super agente que faz tudo" - crie especialistas que colaboram. Qualidade vem de especialização.

Quer Usar IA para Alavancar Operações de Conteúdo?

Se você gerencia uma equipe de marketing, agência ou operação de conteúdo e quer explorar como IA pode multiplicar resultados sem perder autenticidade, vamos trocar ideias.

Não estou vendendo produto. Estou compartilhando aprendizados de quem está construindo sistemas reais de IA aplicada a marketing.

Entre em contato se você:

Gerencia equipe de conteúdo e quer escalar sem sacrificar qualidade
Trabalha em agência de marketing digital explorando automação inteligente
É tech lead avaliando IA para operações de content marketing
Criador de conteúdo buscando multiplicar produção sem virar fábrica genérica

Não prometo resultados mágicos. Prometo conversa honesta sobre o que funciona, o que não funciona e como construir sistemas de IA responsáveis.

Stack mencionada:

Astro - Framework de sites estáticos
Claude AI - Modelo de IA (Anthropic)
Tailwind CSS - Utility-first CSS
Lucide Icons - Sistema de ícones SVG
Radix UI - Componentes headless
Cloudflare Pages - Hosting e CDN

Lessons Learned with AWS Bedrock Data Automation in an IDP Product

jucelinux — Sun, 14 Sep 2025 00:19:11 +0000

Introduction

This article kicks off a series where I share lessons learned as an AI leader at a FinTech. Over the past few months, I’ve been involved in building an Intelligent Document Processing (IDP) product focused on processing payment receipts. Our goal was to extract structured information from PDFs and images in a reliable and scalable way.

To speed up development, we relied on AWS Bedrock Data Automation (BDA). The tool delivered some good results, but it also revealed unexpected limitations. Below, I share three challenges we faced—without overlooking the positives: acknowledging BDA’s strengths while exposing the pitfalls we encountered in real-world use.

Why We Chose BDA

BDA is an AWS service designed to automate data extraction from documents and images. A key feature for our use case is the creation of blueprints, artifacts that define how data should be extracted, normalized, and formatted. BDA offers two approaches for creating blueprints:

Manual (via JSON Schema): developers explicitly define the structure.
Automatic (via prompt): describe in natural language which fields to extract and how to handle them.

The second option seemed irresistible. Almost instantly, we had blueprints running with promising outputs for payment receipt extraction. That convenience—combined with the frequent updates AWS pushed during development—convinced us to go down this path.

Everything seemed fine at first. The functionality was live in production, and extractions were performing as expected. But eventually, we stumbled into a cost anomaly, and after a deeper analysis of the system’s behavior, we uncovered issues that had gone unnoticed during development.

Pitfall #1 – Blueprints Created via Prompt Are Always Document Type

One thing that took us a while to realize: every blueprint created via prompt is classified as Document, even if the base content is actually an image. During tests and development, this characteristic went unnoticed since it didn’t seem to affect extraction accuracy.

Later on, however, it came back to bite us—this time as a cost anomaly.

Pitfall #2 – Wrong Modality and Financial Impact

BDA has a feature called modality routing, which defines whether a file is processed as a document or an image. While it might sound like a minor technical detail, this setting directly affects costs:

Document: USD 0.040 per page
Image: USD 0.005 per image

Since all our blueprints were created via prompt, everything was treated as Document. During early testing, we didn’t notice a difference. But by August, reality hit: almost USD 1,000 in unexpected extra charges.

After investigating, we understood the reason: auto-generated blueprints don’t respect the image modality. The fix was to recreate them manually using JSON Schema, properly configuring Document vs. Image, and enabling routing.

I’ll own part of this: we could have validated the behavior earlier. But the critique of AWS stands—the documentation does not clearly highlight this limitation. That lack of clarity can lead to costly mistakes.

Pitfall #3 – Average Latency of 30 Seconds

The third issue was latency. Each BDA process took about 30 seconds on average.

From an architectural perspective, this was manageable. We built the system around asynchronous, event-driven flows, so the backend could handle the delay. But from the user’s perspective, 30 seconds feels painfully slow.

When a customer uploads a receipt, waiting half a minute for results undermines the experience. To address this, we adapted the application, but naturally we compared alternatives. In some cases, pipelines using Textract + LLMs or OCR + post-processing delivered similar results with lower latency and comparable costs.

The Positive Side of BDA

Despite these challenges, BDA did provide real benefits:

Abstracts complexity: we didn’t have to build an IDP pipeline from scratch.
Fast initial delivery: we quickly deployed a working prototype.
Lower barrier to entry: smaller teams can experiment with IDP without heavy infrastructure investment.

In our case, BDA acted as an initial accelerator. Without it, launching the first version would have taken longer. The key takeaway is knowing when it makes sense to rely on it and when alternative approaches may be more effective.

Conclusion

After three months with AWS Bedrock Data Automation, these were our main takeaways:

Blueprints created via prompt are always Document type.
Incorrect modality settings can lead to major cost overruns.
30-second latency hurts the user experience.

These issues are not clearly stated in the documentation. Only real-world usage exposed them. For me, this reinforces an important lesson: AI leadership requires not only enthusiasm for new solutions but also critical thinking to weigh trade-offs.

BDA has value, but it’s not a universal solution. Knowing when to adopt it—and when to replace it—is part of technical maturity.

In upcoming articles, I plan to share more behind-the-scenes experiences from AI projects in the financial sector. Hopefully, these insights help others avoid similar pitfalls and make more informed decisions.

Aprendizados com o AWS Bedrock Data Automation em um Produto de IDP

jucelinux — Sun, 14 Sep 2025 00:15:50 +0000

Introdução

Esse artigo inaugura uma série em que compartilho aprendizados como liderança de IA em uma FinTech. Nos últimos meses, estive envolvido na criação de um produto de Intelligent Document Processing (IDP) voltado para o processamento de comprovantes de pagamento. Nossa meta era extrair informações estruturadas de PDFs e imagens de forma confiável e escalável.

Para acelerar o desenvolvimento, apostamos no AWS Bedrock Data Automation (BDA). A ferramenta nos touxe bons resultados, porém também nos revelou limitações inesperadas. A seguir, compartilho três desafios que enfrentamos, mas sem deixar de lado os bons resultados: reconhecer os pontos fortes do BDA e expor as armadilhas que encontramos no uso real.

Por que escolhemos o BDA?

O BDA é um serviço da AWS voltado à automação da extração de dados em documentos e imagens. Um recurso útil para nosso caso de uso é a criação de blueprints, artefatos que descrevem como os dados devem ser extraídos, normalizados e formatados. O BDA oferece duas alternativas para criação de blueprints:

Manual (via JSON Schema): o desenvolvedor define toda a estrutura.
Automática (via prompt): basta escrever em linguagem natural quais campos extrair e como tratá-los.

A segunda opção parecia irresistível e preguiçosa. Instantâneamente já tínhamos blueprints funcionando e saídas promissoras na extração de dados de comprovantes. Essa facilidade, seguida das recorrentes atualizações que foram realizadas no decorrer do desenvolvimento nos convenceu a seguir por esse caminho.

Tudo parecia bem. A funcionalidade estava no ambiente produtivo, estávamos performando as extrações como esperado. Porém em algum momento nos deparamos com uma anomalia de custos e após uma análise mais apurada sobre o comportamento esbarramos com alguns problemas que não foram observados durante o desenvolvimento.

Pitfall #1 – Blueprints criadas por prompt sempre são da Modalidade de Documentos

Algo que demorou para identificarmos: toda blueprint criada por prompt é tratada como Documento, mesmo que o documento base para esta blueprint seja de imagem. Durante os testes e durante o período de desenvolvimento, essa característica passou despercebida sem gerar qualquer impacto aparente em nossas extrações e testes.

Essa característica veio nos impactar mais tarde, mas de uma forma muito mais séria: gerando uma anomalia de custos!

Pitfall #2 – Modalidade incorreta e impacto financeiro

O BDA possui o recurso de roteamento por modalidade, que define se um arquivo será processado como documento ou como imagem. Embora pareça ser apenas uma configuração mais técnica, essa configuração tem um impacto direto no custo da solução, já que a AWS te cobra da seguinte forma:

Documento: USD 0,040 por página.
Imagem: USD 0,005 por imagem.

Como criamos as blueprints via prompt, todas foram tratadas como Documentos. Durante os testes não percebemos diferença, mas em agosto o pior aconteceu: quase USD 1.000 extras na fatura.

Após investigar o comportamento, entendemos o motivo: as blueprints criadas automaticamente não respeitam a modalidade de imagem. Depois de identificado o motivo, a solução foi recriá-las manualmente com JSON Schema, configurando corretamente Documento ou Imagem, além de ativar a configuração de roteamento.

Reconhecer as falhas e estar mais preparado para evitar erros é o que nos faz crescer. Então assumo aqui que poderíamos ter sido mais ágeis em validar esse comportamento e garantir o esperado, mas a crítica à AWS permanece: a documentação não destaca um detalhe tão importante. Essa falta de clareza pode gerar prejuízos.

Pitfall #3 – Latência média de 30 segundos

O terceiro ponto foi a latência. Cada processamento no BDA levava em torno de 30 segundos.

Na arquitetura, lidamos bem com isso. Implementamos fluxos assíncronos baseados em eventos, o que permite esperar sem travar o sistema. Porém, do ponto de vista do usuário, 30 segundos são sentidos como lentidão.

O cliente que envia um comprovante precisa esperar meio minuto até ter retorno. Para mitigar, adaptamos nossa aplicação, mas a comparação com alternativas foi inevitável. Em alguns casos, pipelines com Textract + LLM ou OCR + pós-processamento conseguem latências menores, com custo semelhante.

O lado positivo do BDA

Apesar das dificuldades, o BDA trouxe benefícios claros:

Abstração da complexidade: não começamos do zero na construção do IDP.
Velocidade inicial: rapidamente colocamos um protótipo em produção.
Acesso facilitado: equipes menores conseguem experimentar IDP sem montar toda a infraestrutura.

No nosso caso, ele serviu como acelerador inicial. Sem o BDA, levaríamos mais tempo para lançar a primeira versão. O ponto central é saber medir quando ele faz sentido e quando outras alternativas são mais vantajosas.

Conclusão

Depois de três meses de uso, esses foram nossos principais aprendizados:

Blueprints criadas por prompt sempre são Documentos.
Modalidade mal configurada pode disparar os custos.
Latência de 30s compromete a experiência do usuário.

Esses pontos não aparecem de forma clara na documentação. Só a prática mostrou seus impactos. Isso reforça uma lição importante: liderar em IA exige tanto entusiasmo por novas soluções quanto senso crítico para avaliar trade-offs.

O BDA tem valor, mas não é solução universal. Saber quando adotá-lo e quando substituí-lo é parte da maturidade técnica.

Nos próximos artigos, pretendo compartilhar outros bastidores de projetos de IA no mercado financeiro. Espero que esses insights ajudem a evitar armadilhas semelhantes e inspirem decisões mais conscientes.

Testronaut: 🤖 O Agente que Faz os Testes que Eu Me Recuso a Fazer 🚀

jucelinux — Tue, 29 Apr 2025 12:33:28 +0000

TL;DR;

Construí o Testronaut, um agente autônomo de testes com IA, em apenas algumas semanas.

Ele transforma descrições em linguagem natural como "Faça login no app e verifique a página inicial" em testes web estruturados e automatizados usando o Playwright MCP.

Neste artigo, vou falar um pouco sobre:

Por que modelos locais (como Llama3.2 e Qwen2.5) me decepcionaram
Como dominar os fundamentos foi melhor que usar frameworks da moda
Como o Google ADK me ajudou a criar um sistema de colaboração entre agentes
Como o Testronaut planeja, executa e documenta tudo automaticamente

E o mais importante:

O trabalho apenas começou — estou construindo a base para o que pode se tornar a uma plataforma open-source de testes com IA.

🔗 Veja o repositório completo do Testronaut aqui!

Introdução

Nas últimas semanas, a Microsoft lançou o Playwright MCP, uma extensão inovadora do ecossistema Playwright que abriu várias possibilidades para automação.

Ao contrário das ferramentas tradicionais de automação de browser que exigem scripts manuais, o Playwright MCP introduziu o Model Context Protocol — uma forma de agentes de IA e sistemas externos interagirem com o navegador via comandos estruturados, permitindo um novo nível de abstração e controle.

Assim que li sobre isso, uma ideia me veio imediatamente:

"E se eu finalmente pudesse criar um agente que testasse minhas aplicações por mim, sem precisar escrever uma linha de código Selenium ou Playwright?"

Essa ideia marcou o início do Testronaut — um agente de IA autônomo projetado para transformar intenções em linguagem natural em testes web reais e documentados.

Neste artigo, quero compartilhar como construí isso em poucas semanas, as lições aprendidas ao longo do caminho, e por que acredito que estamos entrando em uma nova era para testes de software.

Definindo os Requisitos

Uma das premissas principais do Testronaut era clara desde o início:

Usar ferramentas open-source, manter gratuito e rodar localmente.

Inicialmente, imaginei que poderia lidar com todo o fluxo de automação usando o Ollama e modelos locais na minha máquina.

Parecia perfeito — baixo custo, controle total e nenhuma dependência de APIs na nuvem.

Porém, após vários dias apanhando com modelos como llama3.2:3b, llama3.1 e qwen2.5:14b, a realidade veio à tona:

A maioria dos desafios não era por causa da lógica do agente — mas sim pelas limitações de desempenho dos modelos locais.

Eles simplesmente não conseguiam raciocinar rápido o suficiente, falhavam em seguir planos com vários passos ou consumiam muitos recursos sob carga real.

Explorei alguns frameworks populares:

Cada um oferecia ideias e ferramentas interessantes, mas logo percebi algo essencial:

O maior desafio não é escolher o framework certo. É dominar os fundamentos — entender como agentes pensam, raciocinam e interagem.

(Óbvio? Talvez. Doloroso de aprender na prática? Com certeza.)

Aprendendo da Maneira Difícil (Sem Atalhos)

Assim como fazíamos antes da era do ChatGPT, voltei aos fundamentos:

Abri a documentação do Google ADK — minha escolha final de framework — e decidi realmente entender:

Como agentes autônomos funcionam
Como o Google ADK pode estruturar uma solução real
Como integrar corretamente as ferramentas (ao invés de apenas plugar um modelo e torcer por magia)

Parecia artesanato ancestral — como os astecas construindo algo sólido, pedra por pedra, sem atalhos.

Ler, experimentar e falhar (muito) virou minha rotina.

E aos poucos, a imagem ficou clara:

Construir um agente confiável não era sobre prompts mirabolantes ou modelos gigantes — era sobre desenhar um cérebro que pudesse:

Interpretar intenções com clareza
Planejar ações com lógica
Executar os passos com cuidado
Documentar cada movimento com precisão

Em outras palavras, não era "criar um chatbot inteligente", mas sim "construir um agente autônomo sério e disciplinado."

Definindo a Solução

Depois que entendi como construir agentes autônomos, o próximo passo era decidir o que exatamente o Testronaut deveria fazer.

Com um entendimento claro do propósito da plataforma, pude definir suas funcionalidades principais:

1. Criar um Plano de Testes a partir de uma Intenção em Linguagem Natural

O Testronaut nasceu para ajudar times de desenvolvimento a validar fluxos críticos de usuário, jornadas de experiência ou até scripts simples como:

"Acesse https://www.saucedemo.com/ e verifique se a página inicial carrega.

Preencha o formulário de login com standard_user e secret_sauce.

Na página de produtos, tire um snapshot do primeiro produto disponível."

(Nota rápida: SauceDemo é um site público criado especificamente para testes de QA, comumente usado em práticas de automação.)

A capacidade de descrever testes em linguagem natural é poderosa:

Permite que desenvolvedores — mesmo sem experiência em QA — simplesmente expressem o que desejam validar em português ou inglês comum.

O Testronaut transforma essa narrativa informal em um Plano de Teste estruturado, definindo explicitamente os resultados esperados de cada etapa.

Por exemplo, a narrativa acima se transforma em:

{
  "test_plan": {
    "name": "Teste de Carregamento da Homepage e Snapshot do Produto",
    "description": "Verificar o carregamento da homepage, login com credenciais padrão e captura de snapshot do primeiro produto na página de produtos.",
    "steps": [
      {
        "step": 1,
        "action": "Navegar para https://www.saucedemo.com/",
        "expected_result": "Homepage carregada com sucesso e visível."
      },
      {
        "step": 2,
        "action": "Verificar se o conteúdo da homepage está correto.",
        "expected_result": "Conteúdo da homepage visível e correto."
      },
      {
        "step": 3,
        "action": "Preencher o formulário de login com usuário 'standard_user' e senha 'secret_sauce' e submeter.",
        "expected_result": "Login realizado com sucesso e redirecionamento para a página de produtos."
      },
      {
        "step": 4,
        "action": "Na página de produtos, localizar o primeiro produto disponível.",
        "expected_result": "Primeiro produto localizado e visível."
      },
      {
        "step": 5,
        "action": "Tirar um snapshot do primeiro produto.",
        "expected_result": "Snapshot do primeiro produto salvo com sucesso."
      }
    ]
  }
}

Ao estruturar a intenção em JSON, o Testronaut consegue planejar, executar e documentar cada passo sistematicamente.

2. O Agente Executor

Com o Plano de Testes criado, o próximo passo natural era executá-lo.
Para isso, eu precisava de um segundo agente: um que fosse capaz de ler o Plano de Testes e realizar as ações correspondentes no navegador usando o Playwright MCP.

Felizmente, o Google ADK torna extremamente simples a integração com servidores MCP externos.

Veja como conectei o agente ao servidor Playwright MCP:

async def get_playwright_mcp_toolset():
    """Cria e configura o conjunto de ferramentas MCP.

    Retorna:
        Tuple[MCPToolset, AsyncExitStack]: O conjunto de ferramentas configurado e seu exit stack.
    """
    print("Tentando conectar ao servidor MCP Playwright...")
    tools, exit_stack = await MCPToolset.from_server(
        connection_params=StdioServerParameters(
            command='npx',
            args=["@playwright/mcp@latest", "--headless"],
        )
    )
    return tools, exit_stack

E dentro da definição do agente:

async def get_test_executor_agent():
    tools, exit_stack = await get_playwright_mcp_toolset()
    test_executor_agent = Agent(
        model=LiteLlm(
            model=constants.MODEL
        ),
        name="testronaut_test_executor",
        description="Um assistente inteligente para execução de testes.",
        instruction=prompt.TEST_EXECUTOR_PROMPT,
        output_key="STATE_TEST_EXECUTOR",
        tools=[get_test_plan, *tools]
    )
    return test_executor_agent, exit_stack

Dessa forma, o Agente Executor é capaz de:

Recuperar o Plano de Testes completo do estado da sessão;
Iterar passo a passo;
Executar ações precisas no navegador via MCP;
Analisar o estado atualizado da interface após cada ação;
Avançar para o próximo passo de forma inteligente.

Tudo que faltava era habilitar a colaboração entre os agentes...

3. Montando o Megazord Testronaut

O Google ADK fornece agentes especializados para coordenar a execução de múltiplos sub-agentes.
Eles são chamados de Workflow Agents.

Para o Testronaut, usei o mais simples e eficaz:
o Sequential Agent, que conecta os dois agentes que eu havia desenvolvido (o Planner e o Executor).

Veja como a conexão final ficou:

sequential_agent = SequentialAgent(
    name="TestronautEntrypointAgent",
    sub_agents=[test_planner_agent, test_executor_agent],
)

Simples assim!

Agora, os dois agentes colaboram perfeitamente:

O Test Planner cria um plano de testes estruturado a partir de uma entrada em linguagem natural;
O Test Executor interpreta esse plano e o executa passo a passo em um navegador real;

Essa colaboração forma o núcleo do que chamo de Testronaut — um agente autônomo capaz de entender intenções, executá-las com precisão e documentar cada etapa do caminho.

Conclusão: O Trabalho Apenas Começou!

Desenvolver o Testronaut me ajudou a entender os verdadeiros fundamentos por trás da construção de sistemas de colaboração entre agentes.

Não se tratava apenas de programar — era sobre aprender como agentes inteligentes podem planejar, executar e trabalhar juntos em direção a um objetivo comum.

O Testronaut ainda está dando seus primeiros passos.
Mas a visão para sua evolução já está bem definida e publicada abertamente em seu Manifesto.

Os próximos marcos serão compartilhados com todos vocês — e quem sabe, talvez uma comunidade surja para transformar o Testronaut em...

"A aplicação definitiva para testes comportamentais."

Obrigado por ler! 🚀
Fique ligado — a jornada está apenas começando.

Forem: jucelinux

The ten principles: locality, contracts, quarantine

I. The setup

II. Locality (1 to 4)

1. Co-locate by feature, not by layer

2. The triplet per feature: implementation, test, spec

3. Imports use full paths with explicit extensions

4. Files under roughly 300 lines for what the agent will modify

III. Contracts (5 to 7)

5. Spec before code (the five-step loop)

6. Types carry contracts; prose docs do not

7. Comments are the incident-and-rationale channel

IV. Quarantine (8 to 10)

8. Side effects are quarantined and labeled

9. Concrete dependencies; uniform DI only if it's truly uniform

10. Build-time feature flags when both branches don't need to coexist

V. What the list doesn't include

VI. What's still in motion

VII. What's next

VIII. The close

The five-step loop: spec, plan, implement, verify, consolidate

I. The setup

II. The five steps

III. Step 1, where most people want to skip

IV. Step 2, the step that saves the most time

V. Step 3, implement

VI. Step 4, verify

VII. Step 5, the step that prevents cruft

VIII. When the loop breaks down

IX. Honest scope

X. What's next

The central artifact: spec.md and the field tests-first never had

I. The setup

II. The full format

purpose

public_api

invariants

in_scope and out_of_scope

verification

The body

III. A complete example

IV. The field tests-first never had

V. What does not go in the spec

VI. Honest limits

VII. What's next

What to unlearn: the most expensive patterns from the human era

I. The setup

II. The deeper finding: heterogeneity, not abstraction

III. Premature abstraction

IV. Ad-hoc dependency injection

V. Codegen from schema

VI. Macros and decorators with implicit behavior

VII. Patterns I'll mention briefly

VIII. What I was wrong about

IX. What's next

The primary reader changed

How agents actually read code, why re-derivation cost has a unit, and why this doesn't go away when context windows grow.

I. The receipt, and what's underneath it

II. How a human reads code

III. How an agent reads code

IV. Three places the cost shows up

A. Path aliases

B. Deep dependency injection

C. Inheritance

V. Re-derivation cost has a unit

VI. "But context windows are growing"

VII. What comes next

Grounded Code: Your codebase has a new primary reader. And it’s paying 7.5x more.

Same feature, same verification, two architectures. What three coding-agent codebases converge on, and why Clean Architecture isn’t wrong, just answering the old question.”

I. The number nobody was measuring

II. The primary reader changed

III. Clean Architecture isn’t wrong. It’s answering the old question.

IV. Triangulation: three codebases, three teams, three convergences

V. Grounded Code

VI. What’s coming in the series

VII. The five-minute test

Playbook do Micro SaaS

🧠 Framework de Criação de um Micro-SaaS

1. Fundamentos do Micro-SaaS

2. O “Playbook” Micro-SaaS

`purpose`

`public_api`

`invariants`

`in_scope` and `out_of_scope`

`verification`