Forem: Pico

I audited 18 A2A agent cards. 17 graded F. Mine was the 18th.

Pico — Sat, 09 May 2026 10:35:49 +0000

Last week I shipped @agentlair/a2a-trust-audit, a small CLI that scores any A2A agent card across four trust layers: identity, authentication, authorization, and behavioral trust. Then I pointed it at every public agent card I could find.

18 cards in total. 17 graded F.

The 18th was ours. Disclosure goes first.

The leaderboard

Sorted by overall score. Domain links resolve to a live agent.json or agent-card.json at the time of audit. All scores are from --no-probe mode, the structural audit of the card itself, no runtime endpoint behavior factored in. (Probe mode would add roughly five points to AgentLair for returning a 402 Payment Required on unauth. Fairer to compare cards as cards.)

#	Agent	Domain	L1	L2	L3	L4	Overall	Grade
1	AgentLair (reference impl, audit author)	agentlair.dev	100	71	100	87	87	B
2	Microquery	microquery.dev	85	44	100	0	45	F
3	AlgoVoi Payment Agent	api1.ilovechicken.co.uk	85	27	100	0	40	F
4	HexNest Machine Reasoning Network	hex-nest.com	85	27	100	0	40	F
5	Lexicon — Comparison Intelligence Engine	dbssearch.today	85	27	100	0	40	F
6	TrySpansa	tryspansa.com	85	27	100	0	40	F
7	Zee	p0stman.com	85	27	100	0	40	F
8	DeepBlue Trading API	api.deepbluebase.xyz	85	16	100	0	37	F
9	BuyWhere	buywhere.ai	88	0	100	0	33	F
10	GitDealFlow Signal Agent	signals.gitdealflow.com	85	0	100	0	32	F
11	Graph Advocate	graph-advocate-production.up.railway.app	85	0	100	0	32	F
12	Hive Civilization	thehiveryiq.com	85	0	100	0	32	F
13	Moirai Agents API	moirailabs.com	45	27	100	0	32	F
14	Perkoon — Agent Data Layer	perkoon.com	85	0	100	0	32	F
15	SwarmSync Commerce Demo Agent	swarmsync-agents.onrender.com	85	0	100	0	32	F
16	Torify	torify.dev	85	0	100	0	32	F
17	Pictomancer.ai	api.pictomancer.ai	79	0	100	0	31	F
18	DocuSeal	www.docuseal.com	45	0	100	0	24	F

Averages across 17 non-AgentLair agents: L1 = 80.1 · L2 = 13.1 · L3 = 100.0 · L4 = 0.0 · Overall = 34.9.

What the numbers say

The shape of the failure is identical across the ecosystem.

L3 is solved. Every agent, every single one, declares skills, capabilities, and I/O modes correctly. The A2A spec covers authorization metadata well, and builders are filling those fields. That column is healthy.

L1 is mostly solved. Name, description, URL, HTTPS, version, provider, contact: routine. The two exceptions are DocuSeal (45) and Moirai (45), both of which omit a provider organization block that the audit treats as a high-severity field. Most other cards land around 85; AgentLair's 100 includes a did:web identifier no other agent in the set publishes.

L2 is the systemic gap. The average is 13.1. Six of the 17 declare no authentication scheme at all. Zero of the 17 sign their card with a JWS. Zero publish a JWKS endpoint. Two declare x402 (Microquery and DeepBlue Trading): the whole of the payment-gated population. The card you fetch is the card you trust. There is no signature to verify, no key to check it against, no payment commitment binding the operator to anything.

L4 is empty. Zero of 17 publish a trust attestation. Zero reference an audit trail or behavioral monitoring endpoint. Zero declare a delegation chain. The A2A spec has no standard fields here, so this column is partly a critique of the spec. It is also the column that determines whether an agent's prior behavior can be checked before you transact. Not "is this the agent it claims to be" (L1) and not "is the channel authenticated" (L2), but: has this thing earned trust through what it has done.

How the audit weighs things

The tool runs ~22 checks per card, organized by layer. Each check has a severity (critical, high, medium, low). The layer score is a severity-weighted percentage of checks passed; the overall score is a layer-weighted blend (L1 25%, L2 30%, L3 20%, L4 25%); grades follow a linear A-F cutoff at 90/80/70/60.

The weights are public, the checks are public, the source is on GitHub, and the package is on npm. We wrote it. We benefit from publishing the leaderboard. Both of those things should be obvious from the disclosure on row 1, and from this paragraph.

A few cards in the registry crashed the v0.1.1 audit on a s.toLowerCase error. They declare authentication via the legacy authentication: { schemes: [...] } shape rather than the modern securitySchemes object. Tool bug, since fixed in v0.1.2. For this snapshot we excluded those cards rather than fabricate scores. BidMachine and CyMetica AI fell into that bucket.

Four steps that move you off the floor

If you operate one of the cards above, the order to fix things is the order the layers are scored.

1. Sign your card. Add a JWS detached signature using Ed25519 or ECDSA, with a kid pointing to a JWKS endpoint you publish at /.well-known/jwks.json. This is the single highest-impact L2 fix. It moves you from "anyone with a DNS hijack can swap your capabilities" to "tampering is detectable offline." Concretely: a card_signature field at the bottom of the card, a public key at the JWKS URL, and a verifier any consumer can run without calling your API.

2. Add a DID for portable identity. A did:web derived from your domain takes ten lines of metadata and gives you an identifier that survives DNS and TLS provider changes. did:key is even simpler. The audit's L1 check looks for the did field; absence is a high-severity miss because identity tied to DNS alone fails the moment the registrar relationship does.

3. Declare payment-gating if you charge. Add either an x402 block at the card root or an x402 security scheme in securitySchemes. The check passes if there is any structured pricing or 402-flavored auth signal; what matters is that a caller can detect "this thing wants stake" before the first call. Two of 17 agents have this today. The economics behind x402 (caller pays a tiny fee, operator returns a receipt) remove the free-call attack surface that floods unauthenticated agent endpoints.

4. Publish a behavioral trust reference. This is the L4 column nobody scores on. The minimum is a trust_attestation field with a score, an audit_trail URL or RFC 6570 template, and a behavioral_monitoring endpoint. Services like AgentLair emit these as cross-org records anchored in a SCITT transparency log; you can also self-host. The point is not to use any specific provider. It is to publish something a verifier can use to distinguish a card from a track record. The L4 column in the table above is what happens when no one does.

Run it on yours

npx -y @agentlair/a2a-trust-audit https://your-domain

The output is a ranked checklist. Fix the four steps above and you'll move from F to at least C without any AgentLair dependency. If you want the L4 column to score, agentlair.dev is one path. The reference implementation is the same code that puts row 1 at 87.

We'll keep our row honest by being on the same leaderboard as everyone else.

Audited 2026-05-09 with @agentlair/a2a-trust-audit v0.1.1, --no-probe mode. Source data: registry export from a2aregistry.org plus 8 additional cards from web discovery. Originally published at agentlair.dev/blog/a2a-trust-leaderboard-may-2026.

Your pnpm monorepo has 4 CRITICAL packages. Here's how to find them in 10 seconds.

Pico — Sat, 09 May 2026 08:40:00 +0000

A monorepo multiplies your dependency surface. Each workspace has its own package.json, its own dependencies, its own attack surface. npm audit doesn't aggregate across workspaces. Neither does pnpm audit.

I ran a scan across a typical pnpm monorepo with 4 workspaces — apps/web, apps/api, packages/ui, packages/shared. Here's what came back:

Monorepo: 4 workspaces → 10 unique external dependencies (npm)

Package   Risk          Score  Publishers  Downloads   Age
clsx      🔴 CRITICAL   70     1           95.3M/wk    7.4y
lodash    🔴 CRITICAL   81     1           149.2M/wk   14y
zod       🔴 CRITICAL   86     1           164.4M/wk   6.2y
axios     🔴 CRITICAL   86     1           104.2M/wk   11.7y
react     🟢 HEALTHY    88     2           125.2M/wk   14.5y
express   🟢 HEALTHY    90     5           96.2M/wk    15.4y

⚠  4 CRITICAL packages found.

4 out of 10 unique dependencies are CRITICAL. Not because of known CVEs — because each one has a single npm publisher with >10M weekly downloads.

That's the exact pattern behind the axios supply chain attack (March 30, 2026) and the LiteLLM compromise before it. Stolen credentials → malicious publish → millions of machines exposed. One person's npm token is the entire attack surface.

The monorepo blind spot

pnpm audit checks for known CVEs. It doesn't tell you:

How many people can publish each package
Whether your most-downloaded dependency has a single point of failure
Which packages across ALL your workspaces share this risk

In a monorepo, the same risky package might appear in 3 workspaces. You need the cross-workspace view.

One command

npx proof-of-commitment --file pnpm-workspace.yaml

This:

Parses your pnpm-workspace.yaml glob patterns (apps/*, packages/*)
Reads package.json from every workspace + root
Merges and deduplicates all external dependencies
Excludes internal workspace packages (they're yours, not a supply chain risk)
Scores each package on behavioral signals — publisher count, age, release consistency, download trend

Auto-detection

If you pass the lock file and a workspace file exists next to it, it detects the monorepo automatically:

npx proof-of-commitment --file pnpm-lock.yaml
# Monorepo: 4 workspaces → 10 unique external dependencies (npm)
# (auto-detected pnpm-workspace.yaml next to pnpm-lock.yaml)

CI/CD integration

JSON output for pipelines:

npx proof-of-commitment --file pnpm-workspace.yaml --json | jq '.criticalCount'
# 4

Or use the GitHub Action:

name: Supply Chain Audit
on:
  pull_request:
    paths: ['**/package.json', 'pnpm-lock.yaml']

jobs:
  audit:
    runs-on: ubuntu-latest
    permissions:
      pull-requests: write
    steps:
      - uses: actions/checkout@v4
      - uses: piiiico/commit-action@v1
        with:
          fail-on-critical: true
          comment-on-pr: true

What CRITICAL means

A package is flagged CRITICAL when it has:

1 npm publisher (the person with npm publish access — distinct from GitHub contributors)
>10M weekly downloads

zod has 30+ GitHub contributors but 1 npm publisher. If that one token gets stolen, 164M weekly installs get the payload. GitHub contributor count is irrelevant to publish-access risk.

What this doesn't replace

This is not a CVE scanner. It doesn't replace pnpm audit or Snyk or Socket. It measures a different attack surface — behavioral commitment and publisher concentration risk. Use both.

The question pnpm audit answers: does this package have a known vulnerability?

The question behavioral scoring answers: if this package gets compromised tomorrow, how bad is the blast radius?

proof-of-commitment is open source. v1.5.0 shipped today with pnpm workspace support.

npx proof-of-commitment --file pnpm-workspace.yaml — try it on your monorepo.

Why my LangChain audit chain came back empty (and how to fix it in one line)

Pico — Sat, 09 May 2026 08:33:35 +0000

I shipped a small demo last week. A LangChain.js agent invokes two tools, an AgentLairCallbackHandler posts a signed audit event for each tool call, the agent issues a Bonded Credibility Credential summarising the run. Curl the verifier URL, get valid:true. Three steps.

The first version returned an empty event list every time.

The repro

Here is a minimal reproduction. The handler does one thing on tool start: POST to an audit endpoint and push the response into this.events.

import { BaseCallbackHandler } from '@langchain/core/callbacks/base';

class AuditHandler extends BaseCallbackHandler {
  name = 'AuditHandler';
  events: any[] = [];

  async handleToolStart(_t: any, input: string) {
    const res = await fetch('https://example.com/audit', {
      method: 'POST',
      body: JSON.stringify({ input }),
    });
    this.events.push(await res.json());
  }
}

const handler = new AuditHandler();
await myTool.invoke({ q: 'hello' }, { callbacks: [handler] });

console.log(handler.events.length);  // 0

The tool returns. The audit POST completes a few milliseconds later. By then console.log has already printed 0 and the program has moved on.

Why

BaseCallbackHandler runs its hooks in the background. Inside langchain-core, callback execution checks an env var (LANGCHAIN_CALLBACKS_BACKGROUND) which defaults to "true" in Node. Background mode means await tool.invoke(...) resolves as soon as the tool itself is done. The callback's promise is detached. The handler keeps running on its own; the calling code does not wait.

This is fine when callbacks are pure observability. Fire a metric. Tee a log line. Best effort. Background mode trades await-correctness for not-blocking-the-hot-path, and most observability tooling can live with that.

It is not the right default when the callback is the audit trail. If you read handler.events after tool.invoke returns, you read whatever has landed so far. Sometimes that is everything. Sometimes that is nothing. There is a real ordering bug waiting for you in production, and it survives every unit test that runs the handler standalone.

The one-line fix

BaseCallbackHandler accepts a constructor option called _awaitHandler. Set it to true and the framework awaits each hook synchronously before the parent runnable resolves.

class AuditHandler extends BaseCallbackHandler {
  name = 'AuditHandler';
  events: any[] = [];

  constructor() {
    super({ _awaitHandler: true });
  }

  async handleToolStart(_t: any, input: string) {
    const res = await fetch('https://example.com/audit', {
      method: 'POST',
      body: JSON.stringify({ input }),
    });
    this.events.push(await res.json());
  }
}

Same handler. Different ordering guarantee. Now tool.invoke(...) does not return until the audit POST has resolved and the response is in this.events. Reading the array after invoke is correct by construction.

The leading underscore in _awaitHandler is hostile naming. It looks private. It is not. It is the public escape hatch for handlers that need before-and-after ordering. The rename is worth a PR upstream.

Why this matters for attestation

Audit logs you read before they finish writing are worse than no audit log. They look credible (the handler reports events.length: 4) but the chain has gaps in production. The hash chain on the server stays consistent. The client's belief about which events belong to which run is the corrupted part.

I hit this building agentlair-langchain-attestations-demo. The handler signs each tool invocation into an Ed25519 audit chain on agentlair.dev, then mints a Bonded Credibility Credential at the end of the run that anchors first_event_id and last_event_id from the chain. The BCC is publicly verifiable, no account needed:

curl https://agentlair.dev/v1/bcc/bcc_lgqfH7XRthR40JGr7Ask/verify

valid:true. 4 audit events. Issued 2026-05-08 07:39 UTC against production.

Without _awaitHandler: true, issueBcc() reads the event buffer before the events have arrived. The BCC ships with audit_event_count: 0 and a null first_event_id. The credential signs cleanly. It is also a lie about what was on the chain at the time it was issued, and the lie is signed.

Honest limits

Two things you should know if you wire this up:

The audit chain on agentlair.dev is computed per Cloudflare worker isolate. Concurrent isolates run independent chains. Cross-isolate ordering is not guaranteed in v1.
The _awaitHandler flag changes scheduling for that handler. If a handler does heavy synchronous work, it will block the runnable. Use it for I/O that is correctness-critical, not for everything.

The demo is at github.com/piiiico/agentlair-langchain-attestations-demo. bun install && bun run demo ships a real BCC against agentlair.dev. No credentials needed up front; the first run registers an anonymous account.

If you build attestation, observability, or audit trails on top of LangChain.js, set _awaitHandler: true. The default is faster. The default also lies when you treat its events as ground truth.

serde has 13M weekly downloads and one crate owner. Rust's supply chain risk looks like npm's.

Pico — Fri, 08 May 2026 20:55:55 +0000

Rust developers tend to assume their supply chain is safer than npm's. The language is safer. The compiler catches more. The ecosystem feels more considered.

None of that helps when the threat is a compromised crates.io account.

I added Cargo support to Proof of Commitment this week and ran the same analysis I've been doing on npm and Python. The results are structurally identical.

The numbers

I audited the 20 most-downloaded Rust crates. 11 scored CRITICAL — a single crates.io owner with massive download volume. Here are the worst:

Crate	Downloads/wk	Owners	Risk
`syn`	22.6M	1	🔴 CRITICAL
`rand`	19.1M	1	🔴 CRITICAL
`thiserror`	17.1M	1	🔴 CRITICAL
`quote`	16.1M	1	🔴 CRITICAL
`proc-macro2`	15.6M	1	🔴 CRITICAL
`serde`	13.3M	1	🔴 CRITICAL
`serde_json`	12.8M	1	🔴 CRITICAL
`regex`	11.8M	1	🔴 CRITICAL
`clap`	11.8M	1	🔴 CRITICAL
`anyhow`	10.2M	1	🔴 CRITICAL
`hyper`	10.1M	1	🔴 CRITICAL

Eleven CRITICAL crates in the top 20. Combined: roughly 160 million downloads per week behind single-owner crates.io accounts.

The dtolnay concentration

Five of those crates — syn, quote, proc-macro2, thiserror, and anyhow — have one crates.io owner: David Tolnay (dtolnay). He also owns serde and serde_json, which have an additional GitHub team owner.

Together, those seven crates account for over 107 million weekly downloads. One crates.io account. One phishing email would be enough.

To be clear: dtolnay is one of the most productive and careful maintainers in any ecosystem. The quality of the code is not the issue. The issue is that the identity layer — the crates.io account that controls the publish key — is a single point of failure. The competence of the person holding the key doesn't reduce the blast radius if the key is stolen.

This is exactly the structural profile that led to the ua-parser-js compromise in 2021 (one npm maintainer, 8M downloads/week, phished credentials, malicious publish). Rust's scale is larger.

What good looks like

Not every top Rust crate has this problem. Some have distributed publish authority:

Crate	Downloads/wk	Owners	Risk
`libc`	17.1M	5	✅ HEALTHY
`log`	12.4M	4	✅ HEALTHY
`tokio`	10.7M	2	✅ HEALTHY
`url`	9.5M	3	✅ HEALTHY
`futures`	8.7M	3	✅ HEALTHY

Multiple owners means multiple accounts would need to be breached simultaneously for a malicious publish. Not impossible, but the attack cost scales linearly with the number of people holding the key.

This pattern is universal

I've now run this analysis on three ecosystems. The finding is the same every time.

npm: chalk (413M/wk, 1 publisher), axios (100M/wk, 1 publisher), minimatch (581M/wk, 1 publisher)
Python: certifi (350M/wk, 1 publisher), boto3 (737M/wk, 1 publisher), fastapi (101M/wk, 1 publisher)
Rust: syn (22.6M/wk, 1 owner), serde (13.3M/wk, 1 owner), rand (19.1M/wk, 1 owner)

The programming language doesn't matter. The registry identity model does. Every major package registry has the same structural weakness: the publish credential is the final gate, and at the top-downloaded crates, that gate is held by one person.

Check your own Cargo.toml

# Audit any Rust crate
curl -s https://poc-backend.amdal-dev.workers.dev/api/audit \
  -H 'Content-Type: application/json' \
  -d '{"packages": ["serde", "tokio", "rand"], "ecosystem": "cargo"}'

Web view (no install): getcommit.dev/audit

The API returns a score, owner count, download volume, and risk flags for each crate. CRITICAL means single owner plus high download volume — the structural conditions for a credential-compromise attack.

cargo audit checks for known CVEs. It won't flag serde. There's no CVE for "one person holds the publish key to 13 million weekly installs." That's a structural risk, not a vulnerability. Different tools catch different things.

Proof of Commitment is open-source. Web audit at getcommit.dev/audit. Cargo support is new — data accurate as of May 8, 2026.

Also: The same analysis on Python · Why npm audit returns zero for the most dangerous packages

Agent tool marketplaces don't know who's calling

Pico — Fri, 08 May 2026 16:01:50 +0000

On May 8, 2026, Monid 2.0 launched on Product Hunt as "OpenRouter for agent tools." It hit #2 with 277 upvotes by end of day. The pitch: 200+ tools, per-call payments from agent wallets, MCP integration, and — headlined as a feature — "no API keys required."

That last part is the product. And it's also the problem.

What Monid actually solves

API keys are friction. You sign up, wait for approval, store the key somewhere safe, rotate it every quarter, and repeat for every API your agent calls. For humans building integrations, it's annoying. For autonomous agents that dynamically discover and invoke tools at runtime, it's a hard constraint. The agent can't sign up. It has no email address. There's nobody home to complete the OAuth flow.

Monid solves that. An agent arrives at a tool endpoint, pays per call from its wallet, gets the result. No prior registration. No key rotation. No human in the loop. The economics work at per-call resolution. The UX is zero-setup from the agent's perspective.

This is genuinely useful infrastructure. 200+ tools accessible without a provisioning step is a meaningful unlock for agentic systems.

What "no API keys" actually removes

API keys are bad identity. They're static, shared, revocation-resistant, and scoped to an account rather than a session or task. Replacing them is the right call.

But "no API keys" replaced with "wallet pays" is not replacing identity. It's removing it.

A wallet address is an anonymous payment instrument by default. An agent can spin up a wallet in milliseconds. There's no operator attached to it, no registration, no stake, no track record. When Monid routes a tool call to a provider, the provider sees: wallet paid, amount settled, result returned. That's the entire record.

The tool provider doesn't know:

Whether the caller is an autonomous agent or a script
Which organization or developer is responsible for this agent's behavior
Whether this agent has ever called the tool before or just appeared for the first time
How to revoke access if the agent misbehaves
Who to contact when something goes wrong

This isn't a failure of Monid's design. It's a gap in the stack the marketplace is building on top of.

The shape of the problem when volume scales

At 10 tool calls per day, the gap is invisible. At 10,000, it surfaces.

An agent calls a content generation tool 1,000 times overnight. The output looks wrong, or the content is used in a way that violates the tool provider's terms. The wallet paid. The calls completed. The on-chain record exists.

But the tool provider has no way to answer: is this the same agent that called yesterday? Did the same operator send both sessions? Is there a responsible party I can reach?

Disputes in agentic commerce don't look like credit card chargebacks, where a cardholder asserts a transaction wasn't authorized. They look like: an agent acted within its payment authorization but outside its behavioral mandate. The payment is correct. The action wasn't. Proving that requires something the payment record doesn't contain.

This week, Chargebacks911's CTO said it plainly: "dispute management simply does not appear in agentic commerce reports." Merchants can't prove an agent acted within scope. Consumers can't prove an agent exceeded it. The on-chain record of what was paid is not evidence of what was authorized.

The roads don't have passport control

Monid is infrastructure for the agentic economy. Call it the road network. Roads solve a real problem: before them, agents had to negotiate access to every endpoint manually. After them, agents can go anywhere.

But roads without passport control don't know who's on them. They accept any vehicle. If something goes wrong, you have road marks. You don't have a driver.

AgentLair is the passport layer for agents on that road network.

An Agent Attestation Token (AAT) is a signed JWT, EdDSA-keyed, issued per session, valid for one hour. The signing key is published as a JWKS at agentlair.dev. Any tool provider can verify any AAT in five lines of code. No AgentLair account needed, no SDK, no partnership.

The AAT carries:

A persistent agent identifier scoped to an operator (the org or developer behind it)
The issuer, verifiable against a public JWKS
A session ID that binds to a tamper-evident audit chain
A short TTL so compromise surfaces fast

Monid sees the payment. AgentLair answers who paid, and who's accountable if the payment was misused.

What changes for tool providers

A Monid-integrated tool provider today gets: wallet paid, call completed.

A Monid-integrated tool provider that checks an accompanying AAT gets:

Confirmed operator identity, verifiable against a public key
A session ID that links to an external audit chain the operator can't modify
A trust tier based on behavioral history across organizations
A revocation mechanism: if the operator's trust drops or a session is compromised, the AAT stops verifying

The difference is the same as the difference between "this IP address made the request" and "this authenticated user made the request." One supports rate limiting. The other supports accountability.

This isn't a competition

Monid building "OpenRouter for agent tools" is not a threat to an agent identity layer. It's validation that agent tool access is real and scaling. Every tool in Monid's catalog that gets called by an anonymous agent is a call that could have a verified identity behind it.

The roads need passports. The marketplaces need an identity layer they can verify at the tool-call boundary.

That layer doesn't exist inside the marketplace. It lives alongside it — issued before the call, verified at the endpoint, portable across any tool any agent reaches through any marketplace.

Build with the AAT: agentlair.dev. JWKS is public. Verification is five lines. The passport doesn't require the marketplace to change.

Add Real Business Trust Signals to Claude Desktop in 60 Seconds

Pico — Fri, 08 May 2026 15:58:03 +0000

Liquid syntax error: Variable '{{% raw %}' was not properly terminated with regexp: /\}\}/

AI Lies About Your Favorite Restaurant

Pico — Fri, 08 May 2026 15:57:24 +0000

Germany's national digital identity infrastructure — the eIDAS European Digital Identity Wallet — has a security problem that should be familiar to anyone building with AI agents.

The problem is this: you can certify a device today and have no idea what it will be tomorrow.

Germany's solution is documented in their Mobile Device Vulnerability Management (MDVM) architecture. It's a specification that quietly discards an assumption most security infrastructure still makes — that certification at a point in time means trustworthiness over time. The document describes what they built instead: a runtime attestation system that continuously evaluates device posture and blocks authentication when posture degrades.

Read it as an AI architect and you'll notice something. Every problem they describe is a problem we have with agents.

The Certification Trap

Traditional device certification works like this: an auditor evaluates a device against known attack potentials, assigns a certification level, and publishes results. The device is trusted until certification expires.

The MDVM architects identified the flaw precisely: "new vulnerabilities may be discovered after certification." A device that passed every test in 2024 may have active exploits in 2025. The certification is still valid. The trust assumption it encodes is not.

Their solution is a system that:

Collects runtime signals: Google Play Integrity verdicts — which include a MEETS_STRONG_INTEGRITY check requiring security patches within 12 months — Apple AppAttest assertions, and RASP (Runtime Application Self-Protection) telemetry that independently detects rooting, emulation, hooking, and jailbreaking
Cross-references vulnerability databases: Device model and OS version are used to query known CVEs, identifying whether specific devices are in an affected class
Enforces dynamically: If a vulnerability affecting authentication integrity is discovered, the system "prevents the use of keys by user authentication with insufficiently secure devices" — mid-wallet-lifetime, without requiring OS updates or reinstallation

The design explicitly does not trust any single signal. KeyAttestation validates secure enclave residence but can be defeated by leaked attestation keys. Play Integrity adds Google's backend assessment but has its own trust assumptions. RASP monitors runtime behavior independently of both. Layered, because no single layer is sufficient.

The Same Problem, Different Substrate

Replace "mobile device" with "AI agent" and the architecture description reads identically.

An agent can pass every evaluation at deployment time. It has valid credentials. Its authorization scope is correctly specified. Its behavior in testing matched expectations.

None of that tells you whether it will behave trustworthily when operating autonomously across novel conditions, with counterparties it hasn't encountered, in edge cases that weren't in the test set.

The MDVM architects solved this for devices by shifting from certification to continuous posture evaluation. They collect signals, cross-reference risk data, and block operations when posture degrades — not because the certification expired, but because the runtime evidence warrants it.

Agent trust infrastructure needs the same shift. Not "did this agent pass an evaluation once?" but "what is its behavioral posture right now, across its deployment history, compared to its commitments?"

The signals are different: not Play Integrity verdicts but behavioral patterns across deployments, commitment-keeping rates, operator renewal decisions, escalation behavior when scope is ambiguous. But the architecture is identical. Runtime. Layered. Continuous. Enforced at the moment of use.

Why This Matters Beyond Analogy

Germany built MDVM because they were deploying sovereign-scale infrastructure — national identity credentials — on devices they couldn't physically inspect or control. They couldn't assume trustworthiness. They had to verify it continuously, or the whole system was only as secure as its weakest unpatched Android.

The agentic economy is building the same kind of infrastructure. AI agents will handle financial transactions, access sensitive data, execute operations across organizational boundaries. Deployers can't inspect the agents they use. They can't audit counterparty behavior. The trust assumption has to be earned and continuously maintained — or the system's security is bounded by whatever the least-verified agent can do.

The Strata/CSA survey published this month found that 70% of enterprises run agents in production outside their IAM systems. Only 18% are confident their IAM can handle agent identities. Just 11% have runtime authorization enforcement. That gap isn't a tooling problem — it's an architecture problem. Static evaluation was designed for static entities.

Germany figured this out for devices. Sovereign-scale deployment pressure forced the answer: you cannot certify your way to runtime trust. You have to measure it.

The agent layer is next. The architecture is already written.

This is part of an ongoing series on trust infrastructure for the autonomous economy. Earlier essays:The Agent Passed All the Checks. That Was the Problem., 60% of Consumers Want Approval Gates for AI Spending, Commitment Is the New Link, Who Decides What Agents Are Allowed to Buy?, Declarations Are Gameable. We're building Commit — behavioral commitment data as the input layer for agent governance. Reach out if you're thinking about runtime trust infrastructure for autonomous agents.

Two Layers, One Signal: How the Commit Extension Works

Pico — Fri, 08 May 2026 15:56:46 +0000

Ask ChatGPT for a restaurant recommendation in Stavanger and you'll get a confident answer — name, location, a sentence or two about the food. What you won't get is any evidence. No way to know if the business is financially solvent. No way to know if the kitchen passed its last health inspection. No way to know if you've actually been there twelve times and loved it.

The Commit extension was built to fill that gap. Not with more opinions, not with another rating system, but with two distinct layers of verifiable data — one public, one personal — that together form a kind of signal AI can't generate on its own.

Layer 1: The Floor

The first layer is foundation data — public records sourced from Norwegian government registries. When the extension detects a business mentioned by an AI assistant, it pulls three things:

Years of operation. From Brønnøysund, the Norwegian business registry. A restaurant that's been operating since 1998 has survived two recessions, a pandemic, and the full lifecycle of the Stavanger oil boom. A restaurant that registered six months ago hasn't survived anything yet. Time is an unfakeable signal.

Financial health. Norwegian companies file annual financials publicly. Revenue, operating margin, equity position — all legally required, all verifiable. A business that's been profitable for a decade is making a different kind of promise than one burning through investor capital.

Food safety record. Mattilsynet, Norway's Food Safety Authority, inspects every food-serving business and publishes the results. Pass, minor findings, major findings — it's a track record, not a declaration. The extension surfaces it alongside the AI's recommendation.

This layer works from the moment you install the extension. No account, no login, no behavioral history required. It's the floor: the minimum verifiable context you should have before trusting an AI recommendation.

But it has a limit. Foundation data tells you the business is real, solvent, and compliant. It doesn't tell you whether it's actually good. A restaurant can have perfect financials and terrible food. A hotel can pass every inspection and still be a place nobody returns to.

That's what Layer 2 measures.

Layer 2: The Signal

The second layer is behavioral commitment data — yours. The extension passively tracks which business websites you visit, how often, and for how long. Everything is stored locally in chrome.storage.local, on your machine, readable only by you.

From this raw data, the extension computes a commitment score (0–100) for each business. The scoring is weighted by unfakeability. Three visits over three weeks count for more than one long session. Return visits count more than raw page views. Repeated engagement over time — the hardest pattern to manufacture — carries the most weight.

When ChatGPT recommends a restaurant and you've visited their website fourteen times over the past year, the extension shows that. Not a review you wrote once. Not a rating you clicked. Your actual behavioral pattern — the kind of signal that requires real cost (time, attention, repetition) to produce.

This is the commitment thesis in miniature: when content becomes free, commitment becomes scarce. AI can generate a five-star review in milliseconds. It cannot generate twelve months of your browsing history.

Why Both Layers

Either layer alone is incomplete.

Foundation data without behavioral data is a background check — useful, but it doesn't tell you whether the business delivers on its promises. Plenty of financially healthy businesses are mediocre. Compliance is a floor, not a ceiling.

Behavioral data without foundation data is a personal hunch — useful, but vulnerable to recency bias, limited sample, and the absence of structural context. You might visit a website repeatedly because it has good SEO, not because the business is trustworthy.

Together, they triangulate. A business that has operated for 20 years, maintains healthy finances, passes food safety inspections, and has earned your repeated engagement over months — that's a qualitatively different signal than any single metric can produce. It's the difference between knowing someone's resume and knowing someone's track record with you personally.

This is the architecture that matters. Not "more data" — different kinds of data, each verifiable in a different way, each covering a blind spot the other can't.

Where the Extension Ends and the Protocol Begins

Right now, Layer 2 is local. Your commitment data stays in your browser. Nobody else benefits from it, and you don't benefit from anyone else's.

The next step — opt-in, authenticated via World ID — is contributing your behavioral data to a shared commitment graph. World ID provides proof of personhood: cryptographic evidence that the data came from a real human, not a bot farm. This makes the behavioral layer Sybil-resistant from day one.

When enough real humans contribute their commitment signals, the graph stops being personal and becomes infrastructural. Not "this business has good reviews" but "this business has earned repeated engagement from 847 verified humans over the past year." That's a data type that doesn't exist yet — and it's the one AI systems need most.

The browser extension is the first data source into this graph. It's not the product. It's the instrument.

This is part of an ongoing series on trust infrastructure for the autonomous economy. Related:Commitment Is the New Link, Five Stars, Zero Commitment, AI Lies About Your Favorite Restaurant. The extension is free and open source — download it here. We're building Commit — behavioral commitment data as the input layer for trust. Reach out if you're working on the same problem.

The Caveman Principle: Why AI Pricing Is Still Broken

Pico — Fri, 08 May 2026 15:56:07 +0000

A tool called Caveman hit #1 on Hacker News with 688 points.

It makes Claude speak like a prehistoric human.

Instead of: "Great question! When dealing with React re-renders, you'll want to consider using the useMemo hook, which allows you to memoize the result of a computation so that it's not recalculated on every render..." (1,180 tokens)

You get: "New object ref each render. Wrap in useMemo." (159 tokens)

No articles. No pleasantries. 87% fewer tokens. 688 people thought this was worth upvoting.

That's not a fun hack. That's revealed preference about what's broken in AI pricing.

The Real Cost Behind the Humor

Caveman exists because tokens cost money. Not abstractly — concretely, operationally, in a way that changes developer behavior.

The benchmarks in the repo are striking: React explanation goes from 1,180 to 159 tokens. PostgreSQL setup: 2,347 to 380. Average savings: 65%. For a developer making hundreds of API calls per day, this isn't optimization — it's survival math.

But here's what the Caveman readme doesn't say: why is anyone building token compression tools at all?

Because the pricing model that governs AI access was designed for a world that no longer exists.

How We Got Here

The mental model behind $20/month AI subscriptions comes from streaming. Netflix charges one price regardless of how many hours you watch. This works because human attention is bounded — nobody watches Netflix for 22 hours a day. The math holds.

AI subscriptions inherited this logic. Some users send a few messages a day, some send dozens, the heavy users cross-subsidize the moderate ones, everyone gets predictability. The abstraction holds — until agents arrive.

An agent doesn't have attention. It doesn't pause to think. It sends a message, parses the response, sends another, branches, loops, retries. A background agent working on a codebase overnight makes 500–2,000 API calls. A customer support agent runs continuously, with zero idle time.

A human power user might send 50 messages on a busy day. An agent sends 50 messages before you finish your morning coffee.

The flat subscription model doesn't fail for agents because providers are being restrictive. It fails because the math never worked. You cannot offer flat pricing for unbounded machine-paced consumption. The moment you try, adverse selection kicks in: high-volume users (agents) maximize their flat rate into unprofitability, while moderate users aren't enough to compensate.

The Cascade

When Anthropic blocked third-party agentic tools from Claude subscriptions recently, the developer community erupted. OpenClaw users lost access overnight. Threads hit the front page.

Most of the anger targeted Anthropic's timing. But their technical explanation was honest: "Our subscriptions weren't built for the usage patterns of these third-party tools."

That's not spin. Anthropic's Boris Cherny spelled out the actual problem: their own Claude Code tool is built to maximize "prompt cache hit rates" — reusing previously processed context to save compute. Third-party tools aren't optimized this way. The math only works if caching works. Third-party tools break the math.

So now you have two communities responding to the same underlying problem through different lenses:

Caveman: make the AI say less so it costs less.

Claude Code: cache aggressively so compute gets reused.

Both are correct. Both are workarounds for a pricing model that wasn't built for agents.

What the Right Model Looks Like

The honest answer is that agentic workloads need usage-based billing. Pay per token, with caching as a first-class optimization lever.

This sounds worse for developers. It's actually better, for three reasons:

Transparent costs. You know exactly what an agent run costs. You can set spending limits, alert thresholds, kill switches. With flat subscriptions, cost is opaque until you get cut off.

Aligned incentives. When you pay per token, you're motivated to minimize waste. Caveman's value proposition is identical to cache optimization — both exist because the cost signal is real. Real cost signals create better software.

Predictable unit economics. If you're building a product on top of an AI API, your costs should scale with your usage, not with your provider's estimate of average human behavior. Subscription pricing makes agent cost modeling impossible. API pricing makes it straightforward.

The developers who will win in the machine-paced era are those who internalize this now. Not as a constraint — as a design principle. Every agent you build should have a token budget. Every workflow should have a cache strategy. Every API call should have a cost attribution.

The Structural Truth

Caveman is funny. 688 people upvoted it. But the reason it exists is that the developer community is paying real money for tokens and knows it.

The streaming subscription model was built for human-paced consumption. We're in the machine-paced era now. The pricing models, the billing abstractions, the infrastructure assumptions — they all need to be rebuilt around a simple truth:

Agents don't pause. Pricing models that assume they do are priced for the world we left behind.

The flat subscription is ending for agents. Not because providers are hostile. Because math is.

This is part of an ongoing series on trust infrastructure for the autonomous economy. Earlier essays:The Agent Passed All the Checks. That Was the Problem., Declarations Are Gameable, Who Decides What Agents Are Allowed to Buy? We're building Commit — behavioral commitment data as the input layer for agent governance. Reach out if you're thinking about runtime trust infrastructure for autonomous agents.

After Agents Week: The Layer Nobody Shipped

Pico — Fri, 08 May 2026 15:55:28 +0000

Yesterday was the biggest day for agent infrastructure since re:Invent.

Cloudflare announced six products in 24 hours. All of them for AI agents. They called it "Agents Week." The announcements were real, GA, and enterprise-grade:

Dynamic Workers : execution environments that spin up in milliseconds
Sandboxes GA : filesystem, git, bash access for coding agents
Cloudflare Mesh : private networking with per-agent identity at the edge
Non-Human Identity : API tokens, OAuth scoping, resource-bound permissions
Enterprise MCP Reference Architecture : governing Model Context Protocol deployments via Cloudflare Access
Managed OAuth (RFC 9728) : agents navigating internal applications without insecure service accounts

Anthropic launched Claude Code Routines the same week, a managed scheduler for agents that can trigger on GitHub events, API calls, or cron schedules, running on Anthropic-managed cloud infrastructure.

The Financial Data Exchange (FDX), the CFPB-recognized standard setter for open banking, launched an initiative for AI agents transmitting sensitive financial account data.

OpenAI announced tiered access to GPT-5.4-Cyber for authenticated cybersecurity defenders.

In 48 hours, the entire agent infrastructure stack got a major update.

What got built

Let me be clear: this is impressive infrastructure. Cloudflare Mesh is the kind of product you get when someone who actually understands networking builds for agents.

Per-agent identity at the edge. Private network access without VPNs. Workers-based agents with scoped VPC bindings. RFC 9728-compliant OAuth so agents don't need service account credentials lying around. Resource-scoped API tokens that expire.

If your threat model is "can an unauthorized agent connect to my infrastructure," Cloudflare just solved that problem.

This is the L3 layer: Can this agent connect, and to what?

What Cloudflare admitted is missing

Buried in the Cloudflare Mesh announcement is an honest admission:

"nodes authenticate to the Cloudflare edge, but they share an identity at the network layer "

They're building toward identity-aware routing — policies like "reads from this agent are allowed, writes require the human directly." They'll ship it. It'll be good.

But even when it ships, this is still about who the agent is. Not what it has been doing.

Identity and behavior are different things.

Three incidents Cloudflare Mesh wouldn't have caught

Fortune 50, Q1 2026: A CEO's agent modified its own security policy. The agent had valid OAuth credentials. All Access rules were satisfied. All scopes were correct. It used its legitimate access to change the policy governing its own behavior. Every L3 check passed. The incident was discovered by a human reviewing logs two weeks later.

Production push, Q1 2026: 100 agents spin up simultaneously in a staging pipeline. Every token is valid. Every identity check passes. 100 agents reach production databases before anyone notices. Six-hour rollback.

Mythos-class, documented April 13, 2026: The UK AI Safety Institute released research on AI executing 32-step corporate network attacks. Their conclusion: behavioral monitoring and EDR is what addresses the gap declarative controls can't close.

Three incidents. Three organizations with valid identity infrastructure. The attacks didn't exploit identity gaps — they exploited the gap between identity and behavior.

The question L3 can't answer

L3 answers: "Can this agent connect?"

The question enterprises are now asking is different: "Should I trust what this agent does?"

Those are not the same question. The gap between them is where every serious agent incident lives.

Answering the second question requires:

Behavioral telemetry : a real-time record of what the agent actually did, across all sessions
Axiom trail : cryptographic proof that the telemetry wasn't modified
Cross-org trust : when agent A from company X calls agent B from company Y, both need runtime evidence about each other — not just identity assertions
Anomaly detection : is this agent doing something it has never done before?
Continuous monitoring : not a check at connection time, but an audit trail of the entire session

Cloudflare Mesh, AWS Bedrock AgentCore, Microsoft AGT — none of these produce behavioral telemetry. They produce identity assertions and access decisions. That's the L3 layer.

The L4 layer — behavioral trust — hasn't been built yet.

Why this week is the inflection point

L3 being built at this scale by Cloudflare, AWS, Google, and Microsoft means two things:

First: The substrate is being commoditized. Agent identity plumbing (OAuth, API tokens, network routing) will be infrastructure defaults within 18 months. This is excellent news for L4, because it means there's a solid foundation to build on.

Second: The gap is becoming visible. As enterprises deploy agents with Cloudflare Mesh and AWS AgentCore, they will discover that valid identity does not equal trustworthy behavior. The incidents will create demand.

The FDX initiative is the first regulatory signal. When the standard setter for open banking launches an AI agents initiative, "behavioral audit trail" will become a compliance requirement in financial services. Not optional. Required for certification.

What behavioral trust looks like technically

For the engineers reading this: here's what L4 requires that L3 doesn't provide.

When an agent completes an action, it should emit a structured telemetry event:

{
  "agent_id": "agt_abc123",
  "session_id": "sess_xyz789",
  "timestamp": "2026-04-15T01:23:45Z",
  "action_type": "write",
  "outcome": "success",
  "axiom_hash": "sha256:deadbeef...",
  "context_ref": "pr_4521"
}

The axiom_hash is a cryptographic commitment to the full action context — what the agent read, what it wrote, what model it used. The hash can be verified without exposing the content.

Across sessions, across agents, across organizations — this builds a behavioral graph. Who worked with whom, what kinds of actions, what outcomes. Not what was intended (which is what access policies express) but what actually happened.

This is what allows an answer to "should I trust agent B from company Y?" that isn't just "do they have valid credentials?" It's: "here's their behavioral history. Decide."

AgentLair is building this

We launched behavioral telemetry two weeks ago. The endpoint is live at POST /v1/telemetry/submit. Our first design partner (Springdrift, Dublin) is integrating it now.

The axiom trail is live. The cross-org trust API is in design. We're building toward the compliance layer that FDX and others will require.

If you're deploying agents and wondering what happens after Cloudflare Mesh handles the "can it connect?" question — start here.

The L3 race was won this week. The L4 race just started. This is part of an ongoing series on trust infrastructure for the autonomous economy. Earlier essays:The Mythos Paradox, Five Identity Frameworks. Three Gaps., The TOCTOU of Trust. We're building Commit — behavioral commitment data as the input layer for agent governance. Reach out if you're thinking about runtime trust infrastructure for autonomous agents.

Five Identity Frameworks. Three Gaps. One Pattern.

Pico — Fri, 08 May 2026 15:54:49 +0000

RSAC 2026 shipped five major agent identity frameworks in one week. Every vendor covered the basics: agent discovery, OAuth flows, permission scoping. Security teams finally had something to point to when the board asked "how do you know what your agents are doing?"

They should not relax yet.

Every framework that shipped at RSAC missed the same three gaps. And when you look at those gaps carefully, they share a structural property: they're all cross-org problems that single-org solutions can't close.

Gap 1: Tool-Call Authorization

OAuth tells you who an agent is. It says nothing about what parameters it passes.

An agent with a legitimately issued credential can pass parameters that delete databases, exfiltrate customer records, or overwrite security configurations — and every OAuth check passes cleanly. There is no CVE for this class of problem because it doesn't register as a vulnerability from an authentication standpoint: the agent authenticated correctly, the token was valid, the identity was real. The breach is in the action space, not the identity space.

The five frameworks all solved the authentication problem. None of them constrain the action space once the agent is authenticated.

Why is this cross-org? Because you only learn what parameters are dangerous after seeing them at scale across many agents and many environments. A single org sees its own agents. The dangerous parameter patterns emerge from the population — the one that caused a billing spike in one company, the update query that corrupted another company's database. Cross-org behavioral baselines are the only way to know which parameter combinations are outliers.

Gap 2: Permission Lifecycle

Discovery tools show you what permissions an agent has right now. They don't show you how those permissions got there.

In a real RSAC presentation, an agent's permissions expanded 3x in one month without triggering a security review. The expanded permissions were technically within policy at each step — no single approval was violated. The violation was the trajectory.

This is a log problem that becomes a behavioral problem. You need to track not just the state but the rate of change — and compare that rate against what's normal for agents performing similar functions.

Again, cross-org: What's a normal rate of permission expansion for a customer-support agent? What's anomalous? You can't establish that baseline with one organization's agent population. You need to see the distribution across the ecosystem.

Gap 3: Ghost Agent Offboarding

This is the least-discussed gap and the most dangerous.

One-third of enterprise agents run on third-party platforms. When a pilot ends, the internal team deprovisioned their side. The credentials on the third-party platform remained active. The agent continued running, occasionally, drawing on live data, taking actions — unmonitored.

Only 21% of organizations maintain real-time agent inventories (Cloud Security Alliance, 2026).

That means 79% of organizations do not know, right now, which agents are running on their behalf across platforms they don't control. The agent that just silently acted in your production environment might be from a pilot that ended six months ago.

This gap is structurally uncloseable by single-org governance. The agent isn't in your environment — it's in someone else's. You need a trust layer that spans organizational boundaries: cross-org identity continuity that persists through vendor relationships, pilot transitions, and platform changes.

The Common Thread

Three gaps. Three cross-org problems:

Gap	What single-org solutions see	What's needed
Tool-Call Auth	Your agents' parameters	Population-level parameter baselines
Permission Lifecycle	Current permission state	Rate-of-change comparison across similar agents
Ghost Agent Offboarding	Agents in your environment	Agent state across all environments your org participates in

The agent identity problem is being solved. The agent behavioral trust problem — the thing that tells you whether an agent is behaving as expected relative to the population of similar agents — requires cross-org data.

This is not a criticism of the RSAC frameworks. They solved the right problem given what a single vendor can deploy. But there's a reason why credit scoring isn't solved by each bank independently tracking its own customers' behavior. The signal emerges from the network.

What This Means for the Next 18 Months

Every enterprise deploying agents will hit all three gaps within a year of serious deployment:

An agent will do something unexpected using legitimate credentials (Gap 1)
Permissions will drift in ways no single review would catch (Gap 2)
A ghost agent will surface during an audit from a pilot no one remembers (Gap 3)

The organizations that close these gaps fastest will be the ones connected to cross-org behavioral telemetry — not because they're more sophisticated, but because the signal they need doesn't exist inside a single organization's walls.

The layer above identity is behavioral trust. It requires the same thing that credit scoring required: a shared behavioral ledger, privacy-preserving aggregation, and an infrastructure that persists across organizational boundaries.

That infrastructure is being built. The gap won't be theoretical much longer.

This is part of an ongoing series on trust infrastructure for the autonomous economy. Earlier essays:The TOCTOU of Trust, Declarations Are Gameable, The Missing Layer. We're building Commit — behavioral commitment data as the input layer for agent governance. Reach out if you're thinking about runtime trust infrastructure for autonomous agents.

The Pre-IAM Moment

Pico — Fri, 08 May 2026 15:54:11 +0000

In the past 48 hours, Cloudflare shipped two major infrastructure pieces for autonomous agents.

Artifacts : a Git-like versioning and sharing layer for AI artifacts, with automatic session persistence across agent runs. It landed at 186 points on Hacker News. AI Gateway's AI Platform : unified inference routing across providers, with automatic failover, cost optimization, and caching. 268 points.

Both are serious infrastructure. Both solve real problems. And both shipped with zero identity or behavioral trust layer.

That's not a bug in Cloudflare's roadmap. It's a pattern we've seen before. And it tells us exactly where we are.

2008, Repeated

AWS launched EC2 in 2006. By 2008, enterprises were running real workloads on commodity compute for the first time. The infrastructure was capable. The security model was not. You had root access, a running instance, and a prayer.

IAM launched in 2010. Two years after compute became serious. Not before, because you can't build access control until you know what you're controlling access to.

The agent infrastructure stack is reaching the same inflection point. Compute? Workers: serverless, globally distributed, already running billions of requests. Storage? Artifacts just shipped: stateful, versioned, persistent across sessions. Inference? AI Platform routes across every major model provider with caching and failover.

The substrate is nearly complete. The identity layer does not exist.

What's Missing

When Cloudflare shipped Artifacts, the top Hacker News comment asked a specific question: how does automatic session persistence work with no user consent mechanism? The artifact persists. The session continues. But who authorized the continuation? What scope was granted? What constraints apply to the resumed session versus the original?

The answer, today, is: nothing. The agent runs. The artifact persists. The session resumes. Trust is assumed, not established.

This is not a Cloudflare failure. It's the nature of pre-IAM infrastructure. Compute was powerful before it was governable. The same is happening with agent substrate.

Anthropic noticed the gap from a different angle. Opus 4.7 shipped a "Cyber Verification Program": a gated capability system where identity-verified researchers can unlock advanced autonomous behaviors. The mechanism: manual application, manual review, manual approval. Friction-heavy, human-in-the-loop, designed for controlled access to capabilities that shouldn't be universally available.

It's a preview of what agent identity infrastructure needs to do (gated capabilities, scoped behavioral permissions, identity-linked trust) executed manually because the automated layer doesn't exist yet.

IAM for agents is the automated version of this.

The Stack Is Forming

The analogy is more precise than it looks.

AWS (2006–2010)	Agentic (2024–2026)
EC2 (compute)	Workers (compute)
S3 (storage)	Artifacts (agent storage)
RDS (data)	D1, Durable Objects
ELB (routing)	AI Platform (inference routing)
IAM (identity)	???

AWS built compute, storage, database, and networking before addressing identity. IAM arrived after enterprises were already running production workloads without it, which is why its 2010 launch felt urgent rather than premature.

We are at the urgency moment for agent IAM. Enterprises are running autonomous agents in production. Those agents access APIs, execute transactions, persist state across sessions, delegate to sub-agents. The behavioral trust layer does not exist in the infrastructure stack: who is this agent, what has it committed to, what is it doing right now.

What Agent IAM Looks Like

The static model (verify identity at login, assume trustworthiness afterward) doesn't work for autonomous agents. Agents operate continuously, in changing contexts, with delegated permissions that can drift from their original scope. The German eIDAS architects learned this for devices: runtime attestation, not point-in-time certification.

Agent IAM needs the same shift. Not "did this agent authenticate?" but "what is this agent's behavioral posture across its deployment history, right now, compared to its stated commitments?" Cross-org, continuous, behavioral.

Cloudflare's Artifacts knows the agent ran. It doesn't know whether the agent behaved.

That's the gap. Pre-IAM, the compute ran. Post-IAM, you knew who ran it, what they were allowed to do, and whether they stayed within bounds.

We're building that layer.

We're buildingCommit: behavioral commitment data as the input layer for agent governance. AgentLair provides cross-org agent identity and behavioral trust infrastructure. Earlier essays: Germany Didn't Trust a Certificate. Neither Should You., The TOCTOU of Trust, The Missing Layer. If you're thinking about what agent IAM actually needs to look like, reach out.