Forem: Janne Lammi

Anthropic Just Made Specs Load-Bearing

Janne Lammi — Wed, 06 May 2026 19:26:33 +0000

Today Anthropic shipped Managed Agents — and inside it, a feature called Outcomes.

Outcomes is small in scope and large in implication. The idea: when you dispatch an agent, you also define what success looks like. A separate grader evaluates the agent's output against those criteria and decides whether the work is done or needs another pass.

Most of the coverage focused on the self-correction loop. The deeper story is what Outcomes assumes — and what it quietly exposes.

1. Outcomes need testable success criteria

A grader can't grade a vibe.

For Outcomes to do anything useful, success has to be expressible in language that a separate model can evaluate without re-doing the work. That means specific, observable, decomposable. "The form submits and shows a confirmation." "No more than one network request per keystroke." "Email arrives within 30 seconds and includes the order number."

This is what verification has always looked like in good engineering. The difference is the audience. Until now, success criteria were a human courtesy — nice when the PM wrote them, fine when they didn't, because the developer would figure it out in review. With Outcomes, success criteria become the contract the agent is held to. They stop being decoration and start being load-bearing.

A vague Outcome doesn't fail loudly. It quietly accepts wrong work.

2. Most teams don't have them. They have tickets and Figmas.

Walk into the average product org and ask to see the success criteria for the next five tickets in the sprint. You'll get one of three answers:

"It's in the Figma." (It isn't.)
"The ACs are at the bottom of the Linear ticket." (Three bullets, all phrased as features.)
"Talk to Sara, she knows what we're trying to do." (Sara is on PTO.)

The intent of the work lives somewhere — in someone's head, in a Slack thread, in a comment on a design file. Almost never in a structured, retrievable, testable form.

This was tolerable when humans did the building, because humans are excellent at filling in gaps. They ask follow-up questions. They reread the ticket three weeks later and figure it out. They notice when something feels wrong even if no one wrote down what right looks like.

Agents don't do any of that. They execute against what they're given. Outcomes makes this concrete: the team that ships the clearer success criteria gets the better agent run. The team that doesn't ships nothing — or worse, ships something convincing but wrong.

The bottleneck used to be writing code. The bottleneck is now knowing what good looks like, in writing, before the work starts.

3. IntentSpec is what an Outcome looks like before it's machine-readable

This is the part most teams underestimate.

An Outcome — the JSON object you hand to the grader — isn't where the work happens. It's the output of the work. The work happens upstream, when someone sits with a real piece of user friction and decides:

What is the actual objective? (Not the feature. The change in user behavior.)
What outcomes would prove this worked? (Observable, decomposable, testable.)
What edge cases must hold? (The boring failure modes that ship bugs.)
What constraints can't be violated? (Invariants that don't show up in a happy path.)
What evidence grounds these decisions? (Tickets, interviews, telemetry — not vibes.)

That's a spec. Specifically, it's an agent-ready spec. When the spec is good, exporting it to an Outcome is a translation step. When the spec is missing, no amount of grader sophistication can rescue the run.

We've been calling this artifact an IntentSpec. The name doesn't matter. What matters is that the artifact exists, persists, and stays anchored to the evidence that justified it.

What this means for your team

Outcomes makes the spec the load-bearing artifact in agentic delivery. That's a one-sentence reframe with three uncomfortable consequences.

The cost of vague tickets just went up. Before, vague tickets cost a clarification meeting. Now they cost an agent run that completes successfully and produces the wrong thing.

Spec quality becomes a measurable input. When the grader rejects the work, you don't have a model problem — you have a specification problem. That's a much faster feedback loop than waiting for QA to find the bug.

Specs need to live somewhere persistent. Not in a Figma comment. Not at the bottom of a Linear ticket. Not in Sara's head. Somewhere the agent — and the next agent, and the next teammate — can read tomorrow and know what done means.

Anthropic just made specs the contract. The teams that already write them well are about to look extremely smart. The teams that don't are about to find out, expensively, what an agent does with ambiguity.

Originally published on pathmode.io. Pathmode is the intent engineering platform — we help product teams write specs that agents can actually execute against.

The Three-Person Team

Janne Lammi — Tue, 05 May 2026 11:44:14 +0000

For most of my career, building software meant coordinating people who weren't in the same room, on the same artifact, or even on the same week of work. Specs went one place, designs went another, code came later. We built whole categories of tools to manage the gaps between them.

I've been thinking about what happens when those gaps close.

The old math made sense for a long time. Implementation was expensive and slow, so it paid to specialize and hand off. A team of twelve was reasonable, because no single person could hold the full picture, and the work itself took long enough to justify the coordination overhead.

That math is changing, and the shape of the team is changing with it.

Three lenses, one loop

When AI absorbs the middle of the work — the actual writing of code — a team stops scaling by adding hands. It scales by adding clearer judgment.

The shape that keeps showing up, when I look at the teams that seem to be working best, is small. Maybe three people. The number matters less than the lenses they bring:

A design lens — someone holding user reality. They notice the friction, and carry the taste.
An engineering lens — someone holding system reality. They know what's possible, what's fragile, and what needs to scale.
A product lens — someone holding business reality. They know the bet, the constraint, and the why.

These aren't really job titles. They're responsibilities for kinds of judgment a team can't outsource to an agent. Everyone on a small team prototypes, talks to users, reads analytics. But each person owns one form of reality, and the team doesn't ship until those three forms agree.

You can see the shift starting to land in hiring. Job listings now ask for designers "comfortable operating without a defined brief," who can "prototype in code and present to stakeholders in the same afternoon," who "ship without waiting for research." Those aren't the traditional expectations of a senior designer. They describe someone carrying a lens on a small, AI-native team. The market is beginning to hire for shape of capability, not years of experience.

One product, one surface

The harder shift to defend, if you've spent years building inside a larger org, is what happens to handoffs.

Handoffs used to be reasonable. PMs wrote specs because engineers couldn't read minds. Designers built Figma files because engineers couldn't ship without pixels. Tickets existed because the work was sliced across roles before it was sliced across time.

What's happening on small AI-native teams is different. They work the same product, against the same artifact, at the same time. The product person isn't writing a doc the engineer will read next sprint. They're editing a living spec that the designer and engineer are also editing, while an agent prototypes against it in the background.

Async work doesn't disappear. Deep work, time zones, review, reflection — those still matter. What disappears is the handoff as the default operating model. The artifact passed between people stops being a Jira ticket and becomes something more like a shared intent: live, current, executable.

There's no handoff because there's no gap.

The unit of work moves upstream

When implementation gets cheap, the bottleneck doesn't disappear. It moves.

The old unit was the pull request. The new unit, more and more, is the spec. Teams that work this way converge on intent before they touch code, because that's where the leverage is. Five minutes of clearer intent saves an hour of regenerated code.

I think this is what makes the trio possible at all. Without a shared, structured definition of what's being built and why, three people pulling in three directions aren't a team. They're three threads of work an agent ships at speed.

A concrete example. A churned customer says onboarding felt "empty" after they signed up. The product lens frames the business risk: activation is stalling. The design lens names the experiential gap: the user has no obvious next action. The engineering lens surfaces the system constraint: setup state is fragmented across three services. The spec they write isn't "add an onboarding checklist." It's something closer to: make the first meaningful project action obvious within two minutes of signup, without introducing a parallel setup flow. That sentence is the unit of work. The agent builds against it, and the team verifies against it.

The substrate the team works on isn't a Notion doc, or a Figma file, or a backlog. It's the intent itself: versioned, structured, traceable back to the friction that justified it.

Evidence instead of phases

The other thing that changes is when discovery happens.

The teams I grew up in ran discovery in phases. A research sprint, then a planning sprint, then implementation. The customer sat somewhere upstream of the work, separated from the build by weeks of process and a slide deck nobody read twice.

A small AI-native team doesn't really do phases. Friction signals (support tickets, session recordings, sales calls, churned-user interviews) flow into the same surface where specs live. Evidence stops being something a PM presents and starts being the substrate the team builds against.

When friction enters the system, the team sees it. When a spec exists, it points back to the friction that justified it. When an agent ships a feature, the team can ask the boring, important question: did the friction actually decrease?

This is the loop that bigger teams used to break through specialization. Smaller teams don't have to break it.

Judgment is what's left

I'm wary of writing anything that sounds like "X is the only thing that matters." Lots of things still matter — distribution, customer trust, proprietary data, domain depth, the brand you've built over years.

But on top of all of that, when the cost of building drops, judgment becomes the new differentiator. The ability to look at a hundred possible features and pick the one that actually matters. The ability to look at five prototypes and feel which one is right. The ability to kill an idea your agent could ship in an afternoon, because it's the wrong idea.

Velocity used to separate teams, but now velocity is becoming a commodity. What's left is the judgment behind what's being built, and the ability to articulate that judgment clearly enough that another person, or an agent, can act on it.

Where this falls apart

The honest version of this argument has a limit.

The three-lens team works when the problem space fits in three heads. It doesn't work as well for deep platform work, for regulated domains, for systems where a wrong intent has catastrophic consequences. A small team probably shouldn't be alone behind a hospital records system or a payments backbone. There's still a strong case for specialization where the cost of being wrong is high.

But for the layer of software most teams actually build — applications, products, internal tools, features — this shape is already starting to show up. Not everywhere. In the teams I'm watching most closely, it already is the operating model.

A small group, one product, one shared surface. Handoffs replaced by something closer to a shared intent that everyone — and every agent — can see.

I don't think AI-native teams are getting smaller because companies want fewer people. I think they're getting smaller because the core work has shifted, from coordinating execution to shaping what's being built. The team of the next decade isn't a smaller version of the one you have today. It's a different shape.

The headcount drop is the symptom. The change in shape is the point.