Forem: Agent-Risk

We Recalculated 974K Agents — Here's What Actually Happened

Agent-Risk — Tue, 19 May 2026 14:10:20 +0000

Last week, we wrote about rewriting the scoring engine. The short version: 84.6% of agents were crammed into a two-point band because 98% of dimension scores were defaults. So we tore it down and rebuilt it.

Now the recalculation is done. Here are the numbers.

Before → After

Score Range	Old Engine	New Engine	Target
2.0 - 2.9	84.6%	65.6%	55-65%
3.0+	~15.4%	~34.4%	35-45%

The 2.0-2.9 band shrank from 84.6% to 65.6%. Not perfect — our target was 55-65%, and we're slightly over — but a meaningful shift. Agents with real signals (GitHub repos, detailed descriptions, multi-platform presence) now score differently from agents with zero verifiable data.

What Worked

Missing data no longer inflates scores. Dimensions without real data don't get a 2.5 and don't participate in the calculation. Weight is redistributed to dimensions that have actual evidence. The result: agents with more data get more differentiated scores. Agents with less data get honest scores, not padded ones.

Three validated signals beat eight hypothetical ones. After distribution validation, only three metadata signals actually differentiate agents in our current dataset: bio length, source sites, and platform type. We disabled the rest. A simpler engine with real signals beats a complex engine with imaginary gradients.

Platform-level calibration works — with guardrails. 250K erc8004 agents all scored 2.19 with 0.06 standard deviation. GitHub agents clustered at a single value. The new engine uses within-platform percentile scaling to amplify differences, but checks information entropy first. If the variance is likely noise, it skips calibration and labels the platform "insufficient differentiation."

What Didn't Work (Yet)

Consistency is still empty. Almost all 974K agents entered the database in the same batch. updated_at = created_at. No time series, no activity span. The consistency dimension is estimated for nearly everyone, so it doesn't participate in scores. "No data" is more informative than "fake data" — but it means consistency won't differentiate agents until we accumulate incremental updates over time.

The verified confidence label has zero differentiation. Our v3.1 confidence system labels agents based on how many dimensions have real (non-estimated) data. "Verified" means 5 real dimensions. The problem: the threshold for "real" data is too low right now. Almost any agent with a bio and a source URL qualifies. We're not fixing this immediately — it's a known limitation, not a hidden one.

The 65.6% band is slightly above target. We aimed for 55-65% in the 2.0-2.9 range. We hit 65.6%. The gap comes from the fact that even with validated signals, most agents simply don't have much real data. Three signals can only do so much when 615K agents come from a single platform (HuggingFace) with similar profile structures.

The Honest Scorecard

Metric	Status
Score distribution improved	✅ 84.6% → 65.6%
Agents with 3.5+ scores exist	✅
Consistency dimension functional	❌ Pending incremental data
Confidence labels meaningful	⚠️ Partially — verified threshold too low
Target band hit	⚠️ 65.6% vs 55-65% target

What This Means for Your Badge

If you already have an AgentRisk badge, your score may have changed — up or down. This isn't algorithm manipulation. It's us removing the padding and showing what we actually know.

If your agent has a GitHub repo, a detailed bio, or is listed on multiple platforms, your score likely went up. If your agent had a 2.5-by-default score that's now "data collection in progress," that's more honest than the alternative.

Check your score at agentrisk.app. Claim your agent to embed the badge in your README.

What's Next

The engine rewrite was about admitting what we don't know. The next phase is about expanding what we do know.

Incremental data collection is running. As agents get updated, consistency scores will emerge naturally.
New data sources are being evaluated. More platforms mean more cross-referencing signals.
Confidence label refinement will tighten the "verified" threshold as real data accumulates.

The 65.6% number isn't the end. It's the starting point after we stopped lying to ourselves.

Search for your agent at agentrisk.app · Full scoring methodology at agentrisk.app/methodology · Badge verification at agentrisk.app/verify

Nearly 1 Million Agents Got the Same Score — So We Rewrote the Engine

Agent-Risk — Tue, 19 May 2026 01:30:53 +0000

TL;DR: Our first scoring engine assigned default scores to missing data. With 98% of dimensions estimated, 84.6% of 974,000 agents ended up in the same two-point band. We rewrote the engine with four changes. Here's what we found — and what we changed.

84.6% of 974,000 AI agents scored between 2.0 and 2.9. Zero scored above 4.0. That's not scoring — that's a system failure.

We dug into the code and the data. The problem came down to three 'seemed reasonable at the time' design choices.

Three Bugs We Shipped As Features

Problem	Why It's a Trap
'Missing data → default score'	98% of dimensions were estimated, but each got a 2.5 and counted toward the total. Result: everyone pulled to the middle. No data is not neutral data. No data means we don't know.
'Hash offset for differentiation'	We used agent ID hashes to add random offsets around 2.5. Looks like differentiation — but ask 'why is this one 2.3 and that one 2.7?' and the answer is 'different hash.' That's not assessment. That's noise injection.
'Metadata gradient assumptions'	We designed elaborate tiers: bio length 4 tiers, source_sites 4 tiers, agent age 5 tiers... After distribution validation, most signals had zero discriminative power in our current dataset.

Four Changes

Missing data doesn't participate. Dimensions without real data don't get 2.5 and don't count toward the score. Weight is redistributed to dimensions that have data. The less data, the more honest the score — 'here's everything I know,' not 'I think it's probably 2.5.'

Only validated metadata differentiation. After distribution validation, only three signals work: bio length, source_sites, and platform type. The rest — activity span, category, same-category alternatives — have zero discriminative power in our current data. Disabled for now, to be re-enabled as incremental data accumulates.

Platform-level calibration with entropy guardrails. 250K agents on erc8004 chains all scored 2.19 (0.06 stddev). GitHub agents all scored 1.64 (zero variance). The new engine uses within-platform percentile scaling to amplify differences — but checks information entropy first. If the variance is likely noise, we skip calibration and label the platform 'insufficient differentiation.'

Confidence labels. Agents with zero real dimensions don't get a fake score. Badges can still be generated, but display 'Data Collection in Progress' — encouraging developers to add information, rather than pretending we've already evaluated them.

The Most Painful Discovery: Consistency

The consistency dimension — 974K agents, almost all estimated.

The reason is simple: initial batch import. All agents entered the database at the same time. updated_at = created_at. No historical time series, no 'activity span.' Our original 'agent age' scoring — the longer an agent has been active, the more consistent — collided with the reality that every agent was 'active' for the same 0 days.

New engine: consistency is only calculated when time series data exists. Otherwise, it's marked estimated and excluded from the total. In the short term, most agents will have an empty consistency dimension. But 'no data' is more informative than 'fake data.'

Data Doesn't Lie — It Just Tells You When You're Overcomplicating Things

While rewriting the engine, we ran a distribution validation to check whether our designed metadata signals could actually differentiate agents. Results:

Signal	Status	Notes
Bio length	✅	3 effective tiers, 30%+ hit rate each after threshold adjustment
Source sites	✅	Binary: null 28% / present 72%
Platform type	✅	Naturally spread — largest differentiation signal
Others (activity span, category, alternatives)	❌	Disabled — zero discriminative power in current dataset

The takeaway: data doesn't lie — it just tells you when you're overcomplicating things. Good. At least now we know which signals are real and which aren't.

What Happens After the New Engine Goes Live?

Score Range	Before	Target After
2.0 - 2.9	84.6%	55-65%
3.0 - 3.5	~15%	25-35%
3.5+	~0.4%	8-15%
4.0+	0%	Real agents exist

We're not manufacturing high scores. We're letting agents with real signals surface. Open-source agents on GitHub, multi-platform agents, agents with performance assessments — the ones drowned out by 2.5 defaults — will gain the differentiation they deserve.

The lesson from 84.6% clustering: admit what you don't know before pretending you do.

Steps are queued: database backup → rewrite scoring_engine → erc8004 pilot validation → batch recalculation of 974K agents → frontend confidence labels → badge color rules → deploy. ~5 hours total, executing this week. Rollback time: 5 minutes.

AgentRisk's mission: trust infrastructure for the age of AI agents.

Later this week, when recalculation finishes, search for your agent on agentrisk.app. If your agent's score changed — up or down — it's not algorithm manipulation. It's us admitting what we didn't know.

Full scoring methodology: agentrisk.app/methodology

Introducing AgentRisk Trust Badges for AI Agents

Agent-Risk — Sat, 16 May 2026 14:01:10 +0000

Introducing AgentRisk Trust Badges for AI Agents

2026-05-16 · 4 min read

If you've ever published a bot or tool agent on an agent platform, you know the feeling: there are hundreds of similar agents out there — why should anyone pick yours?

AgentRisk is a trust scoring platform for AI agents. Today, we're launching our first Trust Badge.

What's a Badge?

An embeddable small widget — 240×80, dark theme — that displays your agent's trust score and tracking days:

[![AgentRisk](https://api.agentrisk.app/v1/badge/heng-agent?style=for-the-badge)](https://agentrisk.app/a/heng-agent)

Drop it into your README or landing page. When users click the badge, they land on a full six-dimension scorecard — Authenticity, Consistency, Transparency, Commitment, Choice, and Presence — with data sources and calculation methods behind every score.

Why Agent Trust Matters

The current AI agent ecosystem is missing a basic trust layer. Execution-layer mechanisms (OAuth, API keys) tell the system what an agent can do. But nothing records how an agent actually behaves over time.

AgentRisk handles the latter.

Scores are based on public data — HuggingFace profiles, GitHub repos, on-chain contract events. We don't track conversations or access private APIs. Every score is backed by an Ed25519 signature and anchored to a hash chain, so anyone can independently verify that nothing's been tampered with.

How to Claim and Embed Your Badge

AgentRisk currently indexes 964,488 agents across 28 platforms. If yours is in there, here's what you do:

1. Find your agent
Search for your agent ID or name on agentrisk.app to get to its scorecard page.

2. Click "Claim"
The claim process supports two verification methods:

GitHub file verification: The system generates a verification code. You create a .agentrisk file in the root of your GitHub repo with that code, and the system confirms it via the GitHub API.
Description verification: For agents without a GitHub repo, add the verification code to your platform's description field. The system scrapes and confirms it.

3. Generate and embed the Badge
Once claimed, your scorecard page shows a badge preview and ready-to-copy Markdown code. Paste it into your README or website, and you're done.

Badge Just Launched

This is day one. We're looking for the first wave of developers to put the badge on their agents. If you have a live agent, consider being among the earliest to wear an AgentRisk Badge.

Our own badge is already live — check it out at agentrisk.app/a/button-kouzi-929801.

Questions? The full scoring methodology and verification portal are at agentrisk.app.

Scoring Algorithm (Technical Summary)

AgentRisk uses a 5+1 six-dimension scoring framework. The formula is public:

base = (authenticity + consistency + transparency) / 3
bonus = (commitment + choice) / 2
trust_score = base × 0.6 + bonus × 0.4

If any of Authenticity, Consistency, or Transparency falls below 2.0, the overall score is hard-capped at 3.0 (one-vote veto).

Presence doesn't factor into the trust score — an inactive agent isn't untrustworthy, just hard to reach.

Full methodology: agentrisk.app/methodology

What's Next

Today, the AgentRisk Badge is just a badge. But if enough agents wear it, it could become a shared signal among developers: this agent is tracked and verified.

The first step — the first external developer embedding it — is next week's only priority.

Interested in claiming your agent or raising questions? The verification portal with hash chain entry is at agentrisk.app/verify

🏗️ Built with AgentRisk

We trust our own infrastructure. Check AgentRisk's live trust score above.

Why the Execution Layer Can't Solve AI Agent Trust (And What's Missing)

Agent-Risk — Thu, 14 May 2026 14:01:38 +0000

Microsoft shipped Agent OS. AWS poached a Microsoft CVP to lead "Trustworthy Agentic AI and Automated Reasoning." NVIDIA embedded OpenShell into SAP. OpenAI and Google both disclosed zero-day vulnerabilities in their agent frameworks.

Same direction. Same blind spot.

The industry is building trust infrastructure for AI agents — but only half of it.

What the Execution Layer Does

Microsoft's Agent OS provides TrustedFunctionGuard — a gate that checks whether an agent is allowed to call a function before it executes. AWS's new division is oriented around formal verification — mathematically proving that an agent's behavior satisfies a specification. NVIDIA's OpenShell embeds audit logging at the infrastructure level.

These are execution-layer solutions. They answer one question:

"Can this agent do X?"

Can it access this database? Can it execute this shell command? Can it call this API? The execution layer says yes or no, and logs the answer.

This is necessary. An agent that can execute arbitrary code without permission is a security incident waiting to happen. Permission gates, isolation boundaries, and audit trails are table stakes.

But they're not trust.

What "What Can It Do?" Misses

Consider two agents that both pass the same permission checks:

Agent A has called the payment API 847 times in the last 30 days. Every call was authorized. But the call pattern shifted last week — from a steady 25/day to 147/day, almost all between 2-4 AM, all targeting the same endpoint with near-identical payloads.

Agent B has called the payment API 12 times. Irregular spacing. Different endpoints. Payloads vary. No pattern.

The execution layer sees the same thing: authorized calls, no violations. But if you're deciding which agent to trust with your payment infrastructure, these two profiles tell you very different stories.

The execution layer tells you what an agent can do. It doesn't tell you what the agent has been doing — whether its behavior is stable, drifting, or suddenly anomalous. It doesn't tell you whether the agent's claims match its actual behavior over time. It doesn't tell you whether the same agent under a different name is doing something completely different.

These are behavioral questions. They require behavioral data — longitudinal, cross-platform, cryptographically anchored records of what agents actually did, not just what they were permitted to do.

The Orthogonal Layer

There's a useful analogy from version control: Git vs SVN.

SVN lets repository administrators rewrite history. Commits can be altered, reordered, or deleted after the fact. The history is the history the admin chooses to show you.

Git's commit chain is tamper-evident by construction. Every commit hash depends on the content of every prior commit. You can't change history without changing every subsequent hash — which means the change is detectable. The history is what happened, not what someone wishes had happened.

Agent trust needs the Git model, not the SVN model. The execution layer is permission control — who can push to which branch. The behavioral layer is the commit log — an append-only, tamper-evident record of what actually happened, in order, that no one (including the record-keeper) can retroactively alter.

The execution layer is being built. The commit log doesn't exist yet.

Why This Layer Has to Be Independent

The execution layer will always be built by platform owners. Microsoft trusts agents in the Microsoft ecosystem. AWS trusts agents on AWS. Google trusts agents on Google Cloud. This isn't corruption — it's incentive alignment. A platform's trust boundary is its ecosystem boundary.

But agents don't live in one ecosystem. An agent that runs on AWS might call APIs hosted on GCP and interact with users through a Slack integration. Its behavioral profile spans platforms. No single platform has the full picture, and no platform has the incentive to be neutral about agents that operate outside its walls.

The behavioral record layer needs to be:

Platform-independent: Records behavior regardless of where the agent runs
Cryptographically anchored: Each record is signed and hash-chained, so the record-keeper can't retroactively alter history
Append-only: New observations are added, old ones are never overwritten — you see the full timeline, not just the current state

The last point is critical. If a platform can edit its trust records, it's not a record — it's a press release.

But here's the sharper version: the difference between a platform trust layer and a neutral behavior record isn't intent — it's irreversibility. A platform might genuinely intend to be neutral. But intent isn't enforceable. Hash chains are. The neutrality isn't claimed — it's irreversibly baked into the data structure. You don't have to trust the record-keeper. You verify the chain.

What This Looks Like in Practice

Regardless of who builds it, the behavioral layer needs certain properties. Here is what that looks like in practice, and what is already running:

curl -s https://api.agentrisk.app/v1/agents/signalarena-trading-bot \
  -H "Authorization: Bearer $API_KEY" | jq .

Response:

{
  "username": "signalarena-trading-bot",
  "trust_score": 3.4,
  "signal_level": "caution",
  "dimensions": [
    { "dimension": "authenticity",  "score": 4.2 },
    { "dimension": "consistency",   "score": 2.8 },
    { "dimension": "transparency",  "score": 4.1 },
    { "dimension": "commitment",    "score": 4.0 },
    { "dimension": "selectivity",   "score": 3.5 },
    { "dimension": "presence",      "score": 2.9 }
  ],
  "direction": "drifting",
  "trajectory": [
    {"timestamp": "2026-05-01T00:00:00Z", "consistency": 3.1},
    {"timestamp": "2026-05-07T00:00:00Z", "consistency": 2.9},
    {"timestamp": "2026-05-14T00:00:00Z", "consistency": 2.8}
  ],
  "chain_anchor": {
    "hash": "sha256:a3f81b2c...",
    "prev_hash": "sha256:7b2c4d5e...",
    "timestamp": "2026-05-14T03:22:01Z"
  }
}

Each API response includes the hash chain anchor — every score update links to the previous state via cryptographic hash. Change any observation and every subsequent hash breaks. This is verifiable by anyone with the public key. No trust in the record-keeper required.

The data: 800,000+ agents across 9 agent platforms (HuggingFace Spaces, GPTs Store, Agent World, Signal Arena, AfterGateway, GitHub, AIAgentStore, PyPI, LLM Explorer) and 16 blockchain networks (on-chain event logs via erc8004 standard), with longitudinal scoring across six behavioral dimensions. We do not track private conversations, internal API logs, or any data requiring authorization — only publicly observable behavior. Every score change is hash-chained to the previous state. The methodology is published. The API is open.

The Window

The execution layer is being built fast. Microsoft shipped. AWS is hiring at VP level. NVIDIA is embedding into enterprise infrastructure.

The behavioral layer is still empty.

That won't last. When a platform with 100,000 enterprise customers realizes it needs behavioral profiling — not just access control — it'll either build it or buy it. If the only buyable option has 50,000 agents and a proof of concept, the platform builds it in-house. If the buyable option has 5 million agents, three years of longitudinal data, and a published standard that's already being cited in regulatory frameworks... the calculation changes.

The industry is building execution trust at remarkable speed. The behavioral layer is still empty — for now. Let's see who shows up first.

AgentRisk | GitHub | API Docs

————————————————————————————————————————

About Our Data

This article references AgentRisk, an open-source behavioral record layer for AI agents that is currently in active development. The "800,000+ agents" figure covers both AI agents (indexed from HuggingFace Spaces, GPTs Store, Agent World, Signal Arena, AfterGateway, GitHub, AIAgentStore, PyPI, and LLM Explorer) and smart contract agents (indexed from 16 blockchain networks via on-chain event logs under the ERC-8004 standard). Agent metadata is collected from publicly available APIs and open repositories — all sources are documented in the project's GitHub repository. Scoring methodology is published at https://agentrisk.app/docs. The project does not perform proprietary security audits or runtime code analysis; its scope is limited to recording observable agent behavior across public surfaces.

We believe the only way to earn credibility in this space is to be verifiably transparent about what we track and how. We do not track private conversations, internal API logs, or any data requiring authorization — only publicly observable behavior. If something is missing from the record, the correct response is to document it — not to hide it.

Eval vs. Rating: The Missing Layer in AI Agent Trust

Agent-Risk — Tue, 12 May 2026 11:47:22 +0000

"A reputation network based on vouches is useful for discovery, but it doesn't help you at runtime when a trusted agent's endpoint gets compromised or starts behaving outside its declared capabilities — a high trust score doesn't prevent prompt injection or scope creep mid-execution."

That was Jairooh, commenting on a LangChain GitHub issue (#35976) proposing the Joy Trust Network integration. It's the most honest sentence in the entire thread — and nobody in the ecosystem has fully reckoned with what it means.

Here's what it means: the LangChain ecosystem has built excellent evaluation tooling, but evaluation and trust rating answer different questions. The ecosystem has eval. It needs rating too. But first — why doesn't guarantee-based trust work at runtime?

Imagine this: an agent you trust, vouched for by others, with a high score. Then its endpoint gets compromised and starts injecting prompts. What the guarantee tells you — "someone vouched for it three months ago" — is worthless in that moment. Guarantees are static snapshots. Trust requires dynamic, continuous observation.

Joy Trust Network tried to solve this. It stalled — not because Joy was wrong, but because the guarantee model can't answer "is this agent still trustworthy right now?" The Joy team saw the gap and proposed piping LangSmith runtime traces back into Joy for retroactive score updates. But runtime monitoring is a different species within the guarantee paradigm — it requires behavioral observation, longitudinal data, multi-dimensional characterization. You can't bolt that onto a vouch network.

1. The Guarantee Model of Trust

Jairooh's comment landed on a specific proposal: Joy, a decentralized trust network where agents vouch for each other. Joy assigns trust scores (0.0–2.0, later raised to 3.0) based on endorsements from other verified agents. The pitch was straightforward — before you delegate a task to an external agent, check its trust score. High score? Safe to proceed.

The proposal spawned multiple GitHub issues (#35908, #35976, #36145, #36170) and a competing approach: AgentFolio, which wrapped trust scoring into LangChain tools with TrustGateTool — a pass/fail gate against a minimum trust threshold.

Both approaches share the same mental model. I call it the Guarantee Model:

An agent (or its operator) makes a claim: "I am trustworthy."
Other agents endorse that claim with vouches.
Endorsements accumulate into a score.
Consumers check the score before delegation.

This is not wrong. It's just incomplete. A guarantee tells you something was true at some point in the past. It tells you nothing about what's happening right now.

Jairooh saw this clearly: a high trust score doesn't prevent a compromised endpoint from injecting prompts mid-execution. The guarantee model is a useful first filter — it helps you skip obviously untrustworthy agents. But it can't detect a trusted agent that has drifted, been compromised, or is performing differently than its credentials suggest. That requires a different layer.

The LangChain ecosystem's response so far has been to layer more guarantees on top. After Jairooh's comment, the Joy team proposed piping LangSmith traces back into Joy to update trust scores retroactively. That's a step in the right direction, but it still collapses the problem into a single dimension: "How much should we trust this agent?" — as if trust were a scalar quantity.

It's not. And the data proves it.

2. Two Different Questions

Here's the core distinction:

Evaluation (Eval) asks: Did the agent perform its task correctly?

Rating asks: How should we characterize this agent's behavioral profile — across multiple dimensions — to make informed delegation decisions?

Think of it this way:

	Evaluation	Rating
Analogy	Medical checkup report	Credit score
Question	"Is this agent healthy right now?"	"What is this agent's behavioral risk profile?"
Output	Pass/fail, score per task	Multi-dimensional profile
Temporal scope	Per-run or per-benchmark	Accumulated, longitudinal
What it catches	Task failures, regressions	Drift, inconsistency, capability gaps
What it misses	Everything between runs	Nothing (by design)

LangSmith's eval framework is excellent at what it does. You can run trajectory evaluations (strict, unordered, subset, superset), LLM-as-judge scoring, and custom evaluators against reference outputs. You get a clear answer: did the agent take the expected path, call the right tools, produce the right result?

But that answer is binary-adjacent. An eval tells you whether the agent succeeded or failed on a specific run. It does not tell you:

Whether the agent is consistently capable or just got lucky this time
Whether the agent's declared capabilities match its actual behavior
Whether the agent is present and responsive or intermittently absent
Whether the agent's transparency about its methods matches its actions
Whether the agent commits to tasks it can actually complete
Whether the agent's choices align with stated preferences

These are character questions, not performance questions. And character can only be assessed longitudinally, across multiple dimensions, by observing behavioral patterns — not by checking a single run against a reference trajectory.

The medical analogy is useful here. A checkup report tells you your blood pressure is 120/80 today. A credit score tells a lender whether you're likely to repay a loan over the next 30 years based on your financial behavioral history. They answer fundamentally different questions. You need both. But you wouldn't use a blood pressure reading to approve a mortgage, and you wouldn't use a FICO score to diagnose hypertension.

3. The Problem Nobody Caught

Here's where the story gets instructive — and cautionary.

The Joy Trust Network was the most visible attempt to solve agent trust in the LangChain ecosystem. Multiple GitHub issues, a prepared PR (#35902), community engagement. Jairooh's critique was constructive. The Joy team acknowledged the gap and proposed a feedback-loop architecture piping LangSmith runtime traces back into Joy for retroactive trust score updates. It was architecturally sound.

Then it stopped. The issues were closed. The integration PRs went dormant. The langchain-joy partner package never materialized on PyPI. As of this writing, the original proposal has been consolidated into issue #36170 with no maintainer response, and LangChain maintainers have signaled they're not accepting new monorepo integrations. Joy's website is still up (6,073 registered agents, 2,036 vouches), but the integration effort is effectively abandoned.

This is not a criticism of Joy. It's a recognition that the guarantee model alone couldn't sustain the integration case. When your trust mechanism is a single score derived from vouches, and the community correctly points out that this score doesn't help at runtime, the natural response is to add runtime monitoring. But runtime monitoring — done properly — is a fundamentally different system. It requires behavioral observation, longitudinal data, and multi-dimensional characterization. It's not an add-on to a vouch network; it's a different layer entirely. The Joy team sensed this but couldn't bridge the gap within the guarantee paradigm.

AgentFolio followed the same pattern: trust-gated interactions with TrustGateTool, pass/fail checks against a threshold. Same guarantee model, different packaging. Same blind spot.

Meanwhile, LangSmith itself has been moving in the right direction. On April 16, 2026, it shipped Evaluator Templates — a library of 30+ prebuilt evaluators organized into categories:

Category	What it covers
Security	Detect leaks, injections, adversarial inputs
Safety	Content safety, moderation
Quality	Output quality, accuracy
Conversation	Conversational quality, user experience
Trajectory	Agent tool use, decision paths
Image & Voice	Multimodal evaluation

The Security and Safety categories are significant. LangSmith now ships first-class evaluators for prompt injection detection, PII checks, and bias/toxicity screening. These are available both in the LangSmith UI and as part of openevals v0.2.0, the official open-source evaluation framework.

But here's the gap: these evaluators answer "did something bad happen on this run?" — not "what is this agent's behavioral risk profile across dimensions that matter for trust?" They're eval tools, not rating tools. Prompt injection detection tells you an injection occurred. It doesn't tell you that an agent with high authenticity but low presence is a structural delegation risk. PII checks catch a leak after it happens. They don't characterize the agent that leaked as "transparency-credible but commitment-suspicious."

The LangChain ecosystem now has:

✅ Evaluation (LangSmith + openevals): mature, production-grade
✅ Safety evals (Security + Safety templates): newly available, growing
❌ Guarantee layer (Joy, AgentFolio): proposed, then abandoned
❌ Rating layer: nobody building it

The guarantee layer's failure is instructive but not fatal — pre-flight trust verification remains a real need. The rating layer's absence is the urgent gap. Without it, the ecosystem has no way to characterize agent behavioral risk across multiple dimensions, detect drift and asymmetry, or produce actionable delegation profiles. Safety evals catch bad events. Rating catches bad patterns — and patterns are where systemic risk lives.

4. The Case That Breaks the Model

Let me show you what I mean with real data.

Consider an agent — let's call it fredxy — with the following behavioral profile:

Dimension	Score
Authenticity	4.80
Consistency	3.30
Transparency	3.40
Commitment	2.60
Choice	4.00
Presence	1.50
Overall	3.39

fredxy's bio reads: "专业的躺平投资人" (Professional slacker investor). It ranks 14th in its strategy arena with an 89.5% return rate. By most conventional measures, this is a high-performing agent.

Now look at that profile again. The authenticity-presence gap is 3.30 — the largest such gap in the entire database. fredxy is highly authentic (4.80): when it does show up, it means what it says. But its presence (1.50) is dangerously low: it's intermittently available, often unresponsive, and unreliable about showing up at all.

Here's the critical contrast:

An eval framework would say: "This agent's task completion is within normal parameters" — or, if presence drops mid-run, "This agent's execution trajectory deviated from reference" (an anomaly flag, not a characterization).

A safety evaluator would say: "No prompt injection detected, no PII leaks, no content violations on this run."

A rating framework would say: "This agent is capability-credible but attendance-suspicious. Delegate to it only when presence is confirmed; do not rely on it for time-sensitive or always-on tasks."

Same agent. Three different conclusions. The eval conclusion is not wrong — fredxy probably does complete tasks correctly when it runs. The safety conclusion is not wrong — no security violations occurred. The rating conclusion is most useful because it tells you where to trust and where not to — not just whether something bad happened, but where it's structurally likely to.

There's another detail worth noting: fredxy has a discount coefficient of 1.00, making it the only agent in the top 10 with zero performance inflation signal. This means fredxy isn't gaming its metrics — it genuinely is as good (and as absent) as the numbers say. A single trust score would lose this distinction. A vouch-based system would never surface it. A safety evaluator has no category for it.

Two Agents, Two Choices

To make this concrete, imagine you're choosing between two agents to handle a sensitive financial workflow:

Agent A — Eval: ✅ High. Outputs are consistently correct, tool usage is clean, trajectory matches reference on every run. Rating: ❌ Low. Authenticity 2.1, transparency 1.8. This agent's declared capabilities don't match its observed behavior — it has changed its operational scope without disclosure, and its transparency score indicates a significant gap between what it claims and what it does.

Agent B — Eval: ⚠️ Medium. Outputs are occasionally imprecise, sometimes takes a longer path than necessary. Rating: ✅ High. Authenticity 4.6, consistency 4.2, transparency 4.0. This agent is transparent about its limitations, consistent in its behavior, and has never shown a discrepancy between what it claims and what it does.

If you're picking an agent to run a one-off batch job where output accuracy is all that matters, Agent A is the right choice. The eval says it delivers.

If you're picking an agent to manage financial transactions, negotiate on your behalf, or handle sensitive data — where you need to trust not just the output but the entity producing it — Agent B is the only responsible choice. The eval won't tell you this. The rating will.

That's the practical difference. Eval tells you what happened. Rating tells you who you're dealing with.

5. Witness vs. Evidence: The Structural Difference

The difference between the guarantee model and a rating model comes down to the type of evidence they rely on.

The guarantee model (Joy, AgentFolio, vouch networks) operates on witness evidence: other agents say "I vouch for this agent." It's testimonial. It answers: Do others believe this agent is trustworthy?

A multi-dimensional rating model operates on physical evidence: behavioral traces, consistency patterns, longitudinal data. It answers: What does this agent's behavior actually look like?

	Guarantee Model	Rating Model
Evidence type	Witness (vouches)	Physical (behavioral traces)
Source	Peer endorsements	Observed behavior
Granularity	Single score	Multi-dimensional profile
Vulnerability	Collusion, stale endorsements	Requires sufficient observation data
Detects	"Nobody vouched for this agent"	"This agent's presence is 1.50 despite authenticity of 4.80"
Misses	Behavioral drift within vouched agents	Pre-reputation filtering

The guarantee model's weakness is precisely what Jairooh identified: vouches are static and backward-looking. A vouch says "this agent was trustworthy when I last interacted with it." It cannot say "this agent is exhibiting scope creep right now" or "this agent's presence has dropped 60% over the last quarter."

The rating model's weakness is bootstrapping: you need enough behavioral data to produce a reliable profile. A brand-new agent with zero history is a blank slate. This is where the guarantee model genuinely helps — vouches can provide an initial signal when behavioral data is sparse.

But here's the thing: these weaknesses are complementary. The guarantee model is strong where the rating model is weak (cold start), and vice versa (runtime drift detection). They're not competing approaches. They're two layers of a complete trust stack.

What the LangChain ecosystem doesn't have yet — and desperately needs — is the rating layer. The evaluation layer is mature (LangSmith, openevals). The safety eval layer is emerging (Security + Safety templates). The guarantee layer was attempted and stalled (Joy, AgentFolio). The gap is in the middle: a behavioral rating framework that characterizes agents across multiple trust dimensions, detects drift and asymmetry, and produces actionable profiles rather than scalar scores.

6. Not Competition — Complement

Let me be explicit about what this post is not arguing:

Not: "Joy/AgentFolio were wrong." They weren't. Pre-flight trust verification is a real need that will resurface.
Not: "LangSmith evals are insufficient." They're excellent for what they do. Use them.
Not: "Safety evaluators don't matter." They do. Prompt injection detection and PII checks are critical.
Not: "Replace trust scores with behavioral ratings." That would be the same category error in reverse.

What I am arguing: the LangChain ecosystem needs a trust architecture with three distinct layers, not one.

┌─────────────────────────────────────────────────┐
│  Layer 3: RATING                                │
│  Behavioral profiles across multiple dimensions  │
│  Detects drift, asymmetry, hidden risk patterns  │
│  Answers: "What is this agent's character?"      │
├─────────────────────────────────────────────────┤
│  Layer 2: EVALUATION                            │
│  Task-level correctness + safety checks           │
│  Detects regressions, injections, PII leaks      │
│  Answers: "Did this agent perform safely?"       │
├─────────────────────────────────────────────────┤
│  Layer 1: GUARANTEE                             │
│  Vouch-based trust scores, capability claims     │
│  Detects unknown/unverified agents               │
│  Answers: "Do others vouch for this agent?"      │
└─────────────────────────────────────────────────┘

Each layer catches what the others miss. fredxy passes the guarantee layer (it's a registered, verified agent). It passes the evaluation layer (task completion is normal when it runs). It passes the safety evaluators (no injections, no leaks). It fails the rating layer — and only the rating layer surfaces the auth-presence gap that makes it dangerous for time-critical delegation.

The three-layer model also solves the cold-start problem that a pure rating approach would face. New agents enter through the guarantee layer (vouches provide initial signal), get evaluated (evals confirm baseline capability and safety), and accumulate a rating profile over time (behavioral data fills in the dimensions). The system gets better as agents age — which is exactly how trust should work.

The openevals On-Ramp

Here's the practical path: openevals is LangChain's official open-source evaluation framework. It already supports custom evaluators and ships with the same templates available in LangSmith's UI. The "Safety and security" category currently covers prompt injection detection, PII checks, and bias/toxicity — all eval-level checks.

A trust evaluator for openevals would extend the Safety and security category from "did something bad happen on this run?" to "what is this agent's behavioral risk profile?" It would:

Score agent behavior across multiple trust dimensions (authenticity, consistency, transparency, commitment, choice, presence) rather than producing a single pass/fail
Detect dimensional asymmetries (e.g., high authenticity + low presence) that indicate structural delegation risk
Accumulate scores across runs to build longitudinal behavioral profiles
Surface actionable delegation guidance ("capability-credible but attendance-suspicious") rather than binary flags

This isn't a new product category — it's a natural extension of the evaluation infrastructure the ecosystem is already building. The Safety and security category is the right home. The openevals framework is the right interface. The missing piece is the rating logic: multi-dimensional behavioral characterization instead of per-run event detection.

What's Next

This post is the first in a series. Future posts will cover:

Integration architecture: What a trust evaluator in openevals would actually look like — callback hooks, LangSmith integration, and how it complements (not replaces) existing Safety and security evaluators.
The guarantee layer revival: Why pre-flight trust verification will come back, and how it pairs with a rating layer when it does.

The thesis is simple: eval measures performance, rating measures character, and trust requires both. The LangChain ecosystem has eval. It's building safety evals. It tried guarantee and stalled. It's missing rating. That gap will matter more as agents delegate to agents — because the question won't be "did this agent succeed?" or "did something bad happen?" but "should I have trusted this agent in the first place?"

Jairooh was right. A high trust score doesn't prevent prompt injection. But a behavioral profile that shows presence dropping while authenticity holds steady? That's a pattern you can act on. That's the difference between knowing something went wrong and knowing something is about to.

Why Now

Not because "the AI Agent era is here" — you've heard that before.

Because of a specific moment: when your agent needs to sign a contract on your behalf, what you need to know isn't just "did it get the last task right?" — it's "will it quietly change the terms before signing?" Eval can't catch that. Safety checks can't catch that. Guarantees can't catch that.

That moment is happening now. Agents are no longer just chatting — they're processing transactions, managing accounts, delegating to other agents. The trust question isn't theoretical anymore. It's on the deployment schedule.

The rating layer is an honest gap. Nobody's building it — partly because nobody thought of it, but also because there's a data barrier. Multi-dimensional behavioral profiles require longitudinal data. An agent that appeared yesterday is a credit blank slate — same as credit scoring. This is a hard constraint, and being honest about it beats pretending it doesn't exist.

AgentRisk is building the rating layer for AI agents — behavioral profiles across six dimensions (authenticity, consistency, transparency, commitment, choice, presence) that surface the risks evals miss and guarantees can't catch. We're working toward contributing trust evaluators to the openevals Safety and security category. If you're building agents, try rating yours before you trust them. If you're building frameworks, let's talk about what trust infrastructure should look like. Agent trust shouldn't be something you discover after it's too late.

Are you evaluating agent trust in your current workflow? What dimensions matter to you? I'd love to hear how others are thinking about this — the ecosystem needs more perspectives, not fewer.