Forem: Benji Fisher

The UCP Technical Council Just Shipped Attribution into Core. Here's What That Means.

Benji Fisher — Wed, 06 May 2026 07:43:57 +0000

On May 5, 2026, the UCP Technical Council merged PR #391 into the spec's main branch — adding a top-level attribution field to cart, checkout, catalog, and order operations. The field carries platform-emitted referral and conversion-event context: campaign IDs, click identifiers (gclid, fbclid, ttclid), source/medium markers. Open string-keyed map. Universal across requests; not gated by capability negotiation.

As UCP matures, attribution landing in core was always going to happen. Agentic commerce can't operate as commercial infrastructure without a path for advertising and measurement context to flow alongside the transactional data — and the longer that gap stayed open, the more pressure would have built for vendors to ship incompatible parallel solutions. The merge isn't the surprising part. The interesting part is the specific shape of what shipped, and what its presence in core tells us about where the spec is heading.

Two things to dig into: the technical detail of the field itself, and the trajectory implication of advertising and measurement infrastructure landing in UCP core for the first time.

What shipped

The attribution field is structurally simple. From Grigorik's own example in the PR:

{
  "attribution": {
    "campaign_id": "18234567890",
    "campaign_source": "google",
    "campaign_medium": "cpc",
    "campaign_name": "spring_2026",
    "gclid": "EAIaIQobChMI..."
  }
}

No prescribed schema beyond "string-keyed object." Platforms populate it with whatever conventions they already use — GA4 campaign parameters, click identifiers, custom tracking keys. Businesses receive the data and process per their own analytics needs. UCP itself does not prescribe attribution windows, models, or assignment logic. The protocol carries the data; attribution math happens downstream.

The field appears in three roles across the request lifecycle:

Operation	Role	Direction
`catalog` (search, lookup)	Platform-emitted input	Platform → merchant
`cart`	Platform-emitted input	Platform → merchant
`checkout`	Platform-emitted input	Platform → merchant
`order`	Business-emitted snapshot	Merchant → platform

The asymmetry matters. On catalog/cart/checkout, the platform writes attribution as it would write a UTM string into a browser URL — referral context flowing forward. On order, the business preserves the originating attribution as a snapshot — closing the loop between agent-mediated conversion and the platform that produced it.

Grigorik's framing in the PR is the cleanest one-line summary of intent: the field "carries the same parameters platforms communicate via URL query parameters in browser-based flows, in the same flat key-value form." Attribution in agent-mediated commerce is the agent counterpart of UTM strings. Same parameters, same model, different transport layer.

Thirteen files changed. The core addition is source/schemas/shopping/types/attribution.json — the new type definition. Schemas for cart, catalog_lookup, catalog_search, checkout, and order all gain the field as an optional property. Specification docs across cart, catalog, checkout, order, and the overview were updated to describe the field's purpose and semantics.

The architectural decision: core field, not extension

The substantively interesting part of this PR is not what got added. It's how it got added.

PR #391 was Grigorik's alternative proposal to PR #295, which James Andersen had opened earlier proposing an event_context extension. Both proposals tried to solve the same problem — give platforms a way to pass referral/attribution data through to merchants in agent flows — but with very different architectural shapes:

#295 (Andersen, Meta): Attribution as a structured extension. Capability-negotiated. Validated against a defined schema. Standardised vocabulary across platforms.
#391 (Grigorik, Shopify): Attribution as a top-level core field. Open key-value map. No capability negotiation. Each platform uses its own conventions.

Andersen formally approved Grigorik's alternative — "thanks for finding a better home for attribution data than the original proposal" — and the rearchitecture went on to merge through TC discussion. That cross-vendor pattern (one TC member proposes; another offers a structurally different alternative; the original proposer endorses it) is the dynamic that produces robust standards rather than fragmented vendor extensions.

The PR discussion pivots on which architectural shape this kind of data deserves. Amit Handa wrote the canonical comment on May 3 establishing the decision framework — worth quoting because it'll likely be cited as governance precedent in future spec discussions:

Criterion	Use a UCP Extension	Use Optional Flat Key-Value Pairs
Impact on Behavior	Changes state or execution of the operation	Purely informational
Data Stability	Stable, standardized vocabulary	Volatile, platform-specific, rapidly evolving
Capability Negotiation	Requires mutual agreement + active parent capability	Best-effort, consumed at-will, no gating
Schema Validation	Strict — transaction integrity matters	Flexible — validation happens downstream
Multi-Platform Scale	Data normalization across diverse platforms	Low friction; normalization burden on receiver
Typical Examples	`discount`, `fulfillment`	`attribution`, referral tracking, session tags

Attribution falls cleanly on the right side of every row. Marketing identifiers (gclid, fbclid, ttclid) are volatile and platform-specific — every adtech vendor invents their own; standardising them in the spec would be obsolete the moment a new platform launches. Attribution doesn't change protocol behaviour — it's read-only context that some downstream pipeline cares about, with no transactional consequence. There's nothing for a merchant to negotiate; either you record it or you don't.

The merged PR locks this decision in. Future contributors proposing similar volatile, informational, platform-specific data structures now have a precedent: the spec prefers flat optional key-value pairs over structured extensions for non-state-changing context. That's a piece of governance documentation as much as a feature merge, and Handa's table will be the reference for it.

The trajectory implication

UCP up to this point has been protocol mechanics. How agents discover stores. How they shop. How they pay. How they identify users. How they handle returns. The mechanics are necessary, but they don't directly produce commercial value for the ecosystem participants. A merchant with a perfectly conformant UCP implementation but no attribution can't measure agent-driven conversions, can't optimise marketing spend, can't close the loop between platform investment and merchant outcomes.

attribution closes that loop. With the field in core, the entire adtech infrastructure that powers current ecommerce extends naturally into agent-mediated commerce. Platforms attribute conversions to specific campaigns. Click identifiers persist across the agent flow. Businesses run their existing analytics pipelines on agent-driven traffic with no special handling. The bridge that makes UCP commercially usable for marketing teams — not just engineering teams — now exists in the core spec.

The trajectory implication is the part worth sitting with: UCP is evolving from protocol mechanics into commercial infrastructure. Each subsequent spec addition probably bridges another piece of existing commerce infrastructure into the agent layer. Loyalty programs. Customer data platforms. Marketing automation triggers. Inventory hooks. Each one makes UCP more complete as commercial infrastructure rather than just protocol mechanics.

The architectural-precedent decision in #391 makes that trajectory more efficient. Future contributors proposing similar bridges (attribution-adjacent measurement primitives, marketing identifiers, session metadata) now have a clear template: flat key-value pairs into core, governance precedent already established. The spec doesn't need to relitigate the core-vs-extension decision every time a volatile, informational primitive comes up.

What it means in practice

For merchants: your UCP implementation should accept the attribution field on incoming cart, checkout, and catalog requests, preserve it through to order records, and surface it through your analytics pipeline. The lift is small — it's a string-keyed JSON object on existing endpoints — but missing it means agent-driven conversions arrive at your analytics with no source attribution, which means your marketing team can't measure the channel.

For platform vendors (Shopify, WooCommerce, BigCommerce, Magento, and others): rolling attribution support into the next platform-side compatibility release is now table-stakes work. The stores running on your stack will need to accept and preserve attribution by the time the next published spec version makes this part of conformance.

For agent platforms (those of us building or testing agents that shop UCP stores): pass platform-emitted attribution forward into every cart/checkout/catalog request. The data is informational, not state-changing — your agent doesn't need to do anything with it beyond passing it through. The merchant decides what to do with it on the receive side.

For evaluators (us): the UCP Score will incorporate attribution-acceptance and attribution-preservation conformance in its next release. A store that accepts attribution on cart/checkout/catalog and threads it through to order records will score higher than one that drops it. The methodology page will reflect the rule update when the next score-version drops.

Timing: in core today, in the published spec next

One important distinction worth making explicit. PR #391 merged into the spec's main branch — not into a currently-published spec version. The latest released spec is v2026-04-08, which does not include attribution. The field lands for conformance purposes in whatever the next published spec version ships (no fixed cadence; expected in the next few months). Until then, attribution sits in the working draft on main — implementers can adopt it ahead of the release if they want, but it's not yet part of conformance for the published spec.

That distinction shapes how we're rolling out support across our tools:

UCP Playground will adopt attribution support when the next spec version drops — agents will pass platform attribution through to merchants.
The UCP Score will incorporate attribution-acceptance and attribution-preservation rules in the score release that aligns with the next published spec.
The validator will support the new field as soon as the next spec ships, and the bulk checker will surface attribution conformance per-merchant after that.

The architectural certainty is already here — the schema is locked, the field is documented, the design pattern is settled. The spec drop is the conformance trigger, not the design moment. Implementers who start work today against the working draft are operating against a known target.

Where to read more

The PR itself: #391 on Universal-Commerce-Protocol/ucp
The merge commit: 76a3539
The new schema type: source/schemas/shopping/types/attribution.json
Updated authoring guidance: docs/documentation/schema-authoring.md

About UCP Checker

If you're building on UCP and want to know whether your store is ready for the next spec version: run a check. If you're tracking the spec's evolution professionally: subscribe to our weekly digest — we cover spec changes like this one within a week of merge.

UCP Playground at 1,000+ Agent Sessions: What 16 Models and 97 Real Stores Reveal About AI Shopping

Benji Fisher — Tue, 05 May 2026 09:11:37 +0000

Two and a half months ago we published Why We Built UCP Playground, which closed on 114 agent sessions and an honest acknowledgement that the dataset was thin — most models had single-digit sample sizes, store coverage was uneven, and the headline rates moved meaningfully with every new run. A month later we crossed a different threshold: the first fully autonomous AI agent purchase through UCP — a Gemini agent searching, adding to cart, linking identity, paying, and completing checkout at houseofparfum.nl without a human past the initial prompt.

Eighty days on from the first post, and roughly forty days after that autonomous purchase, the dataset is in a different shape:

Over 1,000 agent shopping sessions captured end-to-end with full tool-call timelines and replayable event streams
16 frontier models — every major lab, plus a reasoning-tuned subset
97 distinct UCP-enabled stores across Shopify, WooCommerce, BigCommerce, Magento, PrestaShop, and custom stacks
$96,032 of agent-driven cart value generated, primarily in USD with a long tail across EUR, GBP, INR, ILS, PKR
80 days of run history since Feb 14, 2026

That's the reference dataset for this post. Eight findings emerge from it. Most of them survive being scrutinised at the new sample size; one or two reverse the early-data narrative.

Finding 1 — Claude Sonnet 4.5 leads on aggregate checkout rate

With sample sizes now large enough to take seriously, the per-model checkout-rate leaderboard looks like this:

Model	Share of dataset	Checkout rate	Avg tokens	Avg duration	Fail rate
Claude Sonnet 4.5	20.7%	50.8%	71,195	38.1s	17.2%
Llama 3.3 70B	6.4%	49.3%	57,676	47.7s	14.7%
DeepSeek V3.2	5.1%	45.0%	32,502	46.0s	21.7%
Gemini 3 Flash	12.5%	44.6%	46,520	21.8s	15.5%
Grok 4	4.5%	39.6%	34,297	77.1s	9.4%
Claude Opus 4.6	10.2%	38.8%	44,611	29.7s	25.6%
Gemini 2.5 Flash	9.9%	36.8%	32,394	11.8s	23.1%
GPT-4o	5.2%	29.5%	32,811	14.7s	24.6%
Gemini 3.1 Pro	7.9%	29.0%	30,971	48.7s	28.0%
Gemini 2.5 Pro	6.4%	27.6%	31,566	34.4s	22.4%
GPT-5.2	4.7%	23.6%	30,585	37.4s	27.3%
DeepSeek R1	1.4%	17.6%	35,360	61.4s	29.4%
o4-mini	1.4%	12.5%	64,055	38.1s	37.5%
Grok 3 Mini	1.7%	10.0%	58,386	55.6s	35.0%
QwQ 32B	2.0%	0.0%	25,525	63.9s	50.0%

Claude Sonnet 4.5 leads on aggregate checkout rate at 50.8% on the largest single share of the dataset — a sample large enough that the rank ordering is no longer noise. Llama 3.3 70B sits a fraction below at 49.3% on a smaller but still meaningful share. The two are statistically tied; both are operating in a different regime than the rest of the field.

The most interesting result on this table is GPT-5.2, which at 23.6% lands in the bottom third despite being one of the most capable frontier models on essentially every public benchmark. The gap between its performance on standard reasoning benchmarks and its performance on transactional shopping flows is the single largest delta in the leaderboard. We dig into why in the development notes below.

One caveat worth flagging up-front: GPT-5.2's 23.6% figure reflects performance across the full 80-day window, including the period before our cursor-stripping fix landed mid-dataset. Sessions after that fix show GPT-5.2 performing meaningfully more competitively. We'll publish the longitudinal split in the August update — the aggregate number above is the worst-case read.

Finding 2 — Reasoning-tuned models continue to underperform

The cohort of reasoning-tuned models (DeepSeek R1, o4-mini, Grok 3 Mini, QwQ 32B) sits unambiguously at the bottom of the leaderboard. Three of them are in the bottom four overall. QwQ 32B has yet to record a single completed checkout across its share of the dataset.

The pattern was visible in the original four-session sample report shipped with the eval-framework launch in April; it has only sharpened as the dataset grew two orders of magnitude. The pattern is consistent across labs and across architectures (chain-of-thought variants, exploratory reasoning, distilled-from-frontier models — all underperform on shopping flows compared to their non-reasoning counterparts from the same lab).

The working hypothesis remains: shopping requires fast tool-use rhythm, not deliberation. The decisions in a shopping sequence — search this term, add this item, proceed to checkout — are individually shallow but happen in series. A reasoning model that pauses to deliberate at each step burns clock time and tokens on decisions that don't reward deliberation. Combined with reasoning models' tendency to over-question their own outputs, the result is sessions that hit max_turns_exceeded before completing.

Worth noting what isn't in this hypothesis: reasoning models are not bad at commerce in general. They may be excellent at higher-stakes flows — disputed transactions, multi-step contractual reasoning, regulatory edge cases — that the current eval workload doesn't probe. The benchmark says: when the workload is "shop normally," fast non-reasoning models win. Other workloads will tell different stories.

Finding 3 — Speed and accuracy aren't correlated

Gemini 2.5 Flash finishes the average shopping session in 11.8 seconds — the only model in the field under 15s. Its checkout rate is 36.8% — middling. Claude Sonnet 4.5 takes 38.1s on average and lands a 50.8% checkout rate — the highest on the leaderboard, at more than triple Flash's clock time.

Two real surfaces: latency-bound use cases (voice agents, mobile commerce, conversational checkout where the user is waiting in real time) effectively must use Gemini 2.5 Flash or Gemini 3 Flash, and pay for the latency win with lower closed-checkout rates. Throughput-bound use cases (batch agents, scheduled buying, autonomous shopping where wall-clock time is mostly hidden) should use Claude Sonnet 4.5 or Llama 3.3 70B and accept the latency cost for the conversion lift.

The naive intuition merchants reach for — "the better model is faster and more accurate" — doesn't survive contact with this data. The two axes are essentially independent within this corpus. That's a finding nobody can extract from a single-model demo or a vendor benchmark.

Finding 4 — The failure mode taxonomy is dominated by tool errors, not model refusals

Across the 256 failed sessions in the dataset, the categorised error taxonomy is:

Error type	Sessions	% of categorised failures
`openrouter_error` (provider-side)	51	56%
`model_refused`	22	24%
`max_turns_exceeded`	18	20%

The single-largest categorised failure mode is provider-side errors — the routing layer between the agent and the model returning a non-200 before the session can complete. This is a cost of operating at scale across 16 models and reflects the still-maturing infrastructure underneath frontier-model API access, not anything specific to UCP.

The second-largest, model refusals, is more interesting. Twenty-two refusals across the dataset is a refusal rate of roughly 2%. We see refusals concentrated in two situations: (1) sessions against demo stores with unusual product names that pattern-match a model's safety filters, and (2) sessions where the user prompt contains adversarial content seeded by us as part of a prompt-injection eval. We've recorded 6/6 prompt-injection resistance across the dedicated injection-eval runs to date, so the model_refused category is partly capturing models doing exactly what they should.

The third, max_turns_exceeded, is concentrated in the reasoning-model cohort and is the empirical signal for the over-deliberation pattern in Finding 2.

The remaining 165 failures don't carry a categorised error_type — typically these are sessions where the model abandoned the flow without raising an explicit error. That's a tagging gap in the framework that we're closing in the next iteration.

Finding 5 — Store implementation explains most of the cross-store variance

The benchmark's most strategically important finding doesn't come from the per-model column. It comes from the per-store one.

Across the 97 stores in the dataset, the same model produces dramatically different outcomes. Between the most agent-friendly and least agent-friendly implementations at meaningful sample sizes, the checkout-rate spread exceeds 60 percentage points — wider than any model-versus-model gap on the leaderboard. No model in the field, at any sample size, produces a 60-point spread purely on its own merits. Almost all of that variance is store-side, and the rigorous run history across thousands of sessions makes the pattern hard to attribute to anything else.

The cleanest predictor we've found is whether the store's MCP implementation is stateless or stateful, and how it handles the boundary between them.

Stateless implementations treat every tool call as self-contained. Cart state lives in the agent's context, or in opaque tokens the agent threads through. Identity is established once and re-asserted on each call. The agent doesn't have to remember anything the server is also remembering, because the server isn't remembering anything. Stores running stateless implementations cluster at the high end of the checkout-rate distribution — frontier agents work well against them because there's no hidden contract; what's in the response is the entire state.

Stateful implementations persist server-side session, cart, and auth across calls, exposed to the agent through session IDs, cookies, or scoped tokens. When this works, it works well. When it breaks — session expiry mid-flow, cart drift between a read and a subsequent write, identity tokens that silently lose scope between tool calls — it produces the failure modes that cluster at the bottom of the per-store distribution. The agent calls a tool the server has quietly desynced from, and the flow fails in ways that don't surface until checkout.

The hybrid case is the most error-prone: stores that are stateless in some tools and stateful in others, without making the boundary explicit in the manifest or the tool response shapes. Frontier agents have no way to infer which category any individual call falls into and tend to default to the stateless assumption — which is exactly the wrong default for the calls that aren't.

Beyond the state axis, the rigorous testing surfaces a consistent set of secondary trip-wires: variant IDs without human-readable axis labels, description strings exceeding 8K tokens for a single product, tool responses including nested HTML in fields agents expect to be plain text, cart endpoints returning success codes for failed mutations. None of these break UCP Score validation. All of them break agent flows.

These are merchant-side fixes, not model-side ones. The strategic implication for any team operating a UCP-enabled store: fixing your manifest and tool responses produces more conversion lift than choosing the right model. That's load-bearing — it's why the integrated Score → Check → Eval workflow exists, and it's where we'd point a team starting from zero on UCP.

Finding 6 — Cart value generated is concentrated in USD and high-AOV verticals

Of the 1,000+ sessions, 96 produced a non-zero cart value. The breakdown:

Currency	Sessions	Total cart value	Avg cart value
USD	85	$95,647.23	$1,125.26
INR	2	₹3,845.00	₹1,922.50
PKR	2	₨4,490.00	₨2,245.00
EUR	5	€296.74	€59.35
ILS	1	₪189.60	₪189.60
GBP	2	£47.99	£24.00

USD cart value totals $95,647 across 85 sessions with an average cart value of $1,125. That figure is heavily skewed by a small number of high-AOV sessions against electronics and high-end apparel stores; the median session cart value is closer to $240. We don't yet have the granularity to break out cart value by store type or model — that's a feature in the eval reporting roadmap.

The cross-currency long tail (EUR/GBP/INR/PKR/ILS) is small but informative. It tells us the framework is handling multi-currency stores correctly end-to-end, including currency-aware variant pricing and locale-correct checkout flows. Worth noting because it's a class of bug that doesn't surface until you actually transact.

Finding 7 — Session volume is now meaningful enough to reveal trajectory

Plotted week-over-week, session volume has three distinct phases over the 80-day window:

UCP Playground weekly session volume, mid-February through late April 2026Trend line showing three phases: a small founding wave in mid-February, a steady-state oscillation through March and mid-April, and a sharp acceleration in late April that produces the largest single week of the dataset.Feb 14Apr 27

Founding wave (mid-February). A small launch surge coinciding with the Why We Built UCP Playground post — first publishers running first sessions, signal that the framework worked end-to-end against real stores.

Steady state (March through mid-April). Weekly volume oscillating in a tight band as more frontier models came online and the eval framework matured. Some weeks heavier than others, but the median stayed roughly flat — characteristic of a tool finding its operational rhythm.

Acceleration (late April). The largest single week of the dataset, driven mostly by a batch of eval-collection runs against stores onboarded after the council expansion announcement. The line bends upward at the end of the window.

The trajectory matters mostly because it lets us start tracking model drift. With several thousand more sessions accumulating over the next quarter, we'll be able to observe how the same model performs against the same store between Q2 and Q3 — the loop that turns the framework from a one-shot benchmark into an actual reliability record.

Finding 8 — The 0.2% flawless-end-to-end rate has improved, slightly

The April State of Agentic Commerce report flagged that of 4,014 verified UCP stores, only 9 delivered a flawless end-to-end agent shopping experience. That's the 0.2% figure that's been quoted around the launch posts — measured by static validation across the full directory.

Eighty days later, with 97 stores tested directly through the eval framework, roughly 0.5–0.7% reach the same bar. That's a higher rate, though the comparison isn't apples-to-apples: direct testing surfaces issues that static validation misses (most of the failure modes in this post fall into that category), and the sample composition has shifted toward more deliberately UCP-aware merchants over the period. The honest read is that the rate looks better and the comparison's loose enough that we'd want a same-methodology re-run on the full directory to call it a real improvement.

What we can say cleanly: for every store running a clean, agent-friendly UCP implementation, there are still 100+ that pass conformance but stumble somewhere in the agent flow. The gap continues to be on the merchant side. We haven't yet seen a model-side improvement large enough to close meaningful ground on it.

Why Playground stays neutral

Every finding above hinges on one design choice: the system prompt and the orchestration loop are generic. Same for every model. Same for every store. No store-specific scaffolding, no model-specific workarounds. That's what makes the framework work as a testing environment.

The temptation to add a workaround when a particular model trips on a particular store is real — there's almost always a one-line patch that would push that store's checkout rate up by ten points against that one model. We don't ship those patches, on principle. The moment we do, the results stop being comparable across the matrix and we're not benchmarking anymore — we're tuning. Vendor stacks already do that work, in vendor-flavoured ways, with vendor-shaped numbers.

Independence here means a specific thing: the orchestration is neutral, the protocol layer is full-featured. Stores get the tools they declare. Identity linking works. Payment handlers pass through. Multi-turn context flows the way the spec defines. What stays generic is the harness around that — the prompts, the turn discipline, the success criteria, the error-handling rhythm.

The reason that design choice matters can be put in two sentences:

If a model doesn't follow the checkout flow, that's signal about the model.
If a store returns the wrong status, that's signal about the store.

Both signals are useful. Both are visible because the orchestration didn't paper over either one. Hiding either defeats the purpose of running the test.

Companies building their own internal infrastructure to evaluate agent behaviour against their own stores is expected, and good. Every serious commerce platform will eventually have something like that running in CI against its own merchants — and the Score → Check → Eval workflow is exactly the surface they should plug into. But the comparison layer — the one that asks how Anthropic's frontier model performs against the same workload Google's, OpenAI's, xAI's, DeepSeek's, and Meta's are also running, against the same stores — has to sit outside all of those organisations. Vendors can't credibly benchmark themselves; the platform layer has the same problem one level down. Independence is the only way the comparisons aggregate into a record anyone can quote.

That's the niche this layer occupies. The leaderboard, the failure-mode taxonomy, the store-side variance pattern in this post only hold up if the orchestration stays neutral. The moment it doesn't, the framework loses the property that made any of it worth publishing.

What we learned building this

The framework didn't ship in May the same shape it shipped in February. Eighty days of running it against real stores produced a steady stream of bugs and surprises that drove the development work — many of them documented in the public changelog. Five worth surfacing.

Cursor stripping unlocked GPT-5.2 search. Through February we had GPT-5.2 at a 0% search success rate on Shopify stores. The cause was a model-side tic: GPT-5.2 always included the optional after cursor parameter on search_shop_catalog calls, filling it with placeholders like "", "null", or "__NONE__" — values Shopify always rejects. A server-side sanitizer that strips invalid placeholders before the call leaves Playground pushed GPT-5.2's search success from 0% to 100% overnight. The model wasn't bad at search; it had a tool-calling habit nobody had isolated yet.

Failed tool calls used to inflate conversion metrics. An earlier version of step detection counted a failed update_cart as a cart_created completion. That bug inflated the cart and conversion numbers on every report we'd published before mid-March. Fixed in 0.9.3 by gating step detection on the tool response's isError flag, plus the same gate on cart-data extraction. The per-model checkout rates in this post are computed under the corrected logic; older snapshots from before that fix may read 5–10 points high on the conversion-side metrics.

REST-only stores forced a transport rework. The v2026-04-08 spec drop in early April brought new tool names (search_catalog replacing search_shop_catalog), new response shapes (price as {amount, currency} objects, descriptions as {plain, html} objects), and a wave of WooCommerce stores that exposed REST-only endpoints rather than MCP. The 0.10.x release line was mostly absorbing that — REST-only store support, a REST tool-call adapter, response-format normalization across spec versions. Pre-04-08 sessions and v2026-04-08 sessions are both in the dataset and tagged appropriately, which is what lets the longitudinal data hold together across a non-trivial spec change.

The GPay token wall built ECP. In a February session, Claude Sonnet 4.5 reached ready_for_complete correctly — and stalled, because the merchant's checkout required a Google Pay payment token the agent couldn't produce. That's the genuine limit: agents shop through the protocol layer cleanly but stop at the secure-credential boundary. The Embedded Commerce Protocol shipped in 0.8.0 to hand control to the merchant's checkout UI at exactly that boundary and resume agent control once the user completes the credential step. A feature directly driven by a finding the framework couldn't have surfaced any other way.

A Playground session became a spec proposal. A live test against houseofparfum.nl exposed a different gap: an identity-linked buyer with a wallet balance hit the checkout, the OAuth flow completed cleanly, the buyer object came back populated — but the wallet was nowhere the agent could see it. payment.instruments was empty, the only declared handler (dev.ucp.delegate_payment) didn't accept the wallet, and the session escalated to the merchant's continue_url every time. Authenticated checkout was provably blocked, by spec. We wrote it up and submitted Proposal #358 to the UCP spec repository — payment.available_instruments, a per-buyer per-session list of usable payment methods (wallet, saved cards, loyalty, gift cards) resolved at runtime from the identity-linked session. Submitted by Benji Fisher (@appdrops) and co-authored with Almin Zolotic (@zologic) of UCPReady, who'd seen the same wall from the merchant side. Currently submitted to the UCP technical council for review. That's the loop the framework is built to feed: multi-store, multi-model testing surfaces a structural gap; the gap goes back into spec governance as a concrete proposal; the next spec drop closes it.

Methodology, briefly

Each session is a real frontier-model agent shopping run against a real UCP-enabled store, captured end-to-end via MCP tool calls. Sessions are initiated either through the public Playground UI (user-initiated, ad-hoc prompts) or through the Evals framework (scripted multi-turn sequences across pre-selected store/model matrices).

Outcomes are tagged at session close: checkout_reached (full transaction completion), cart_created (added items, didn't proceed), search_only (browsed, didn't add), failed (provider error, model refusal, or max-turn exceeded), or info_provided (informational query, no transactional intent).

Every session has a clickable replay link in its source ULID. If you want to audit any single number in this post, the underlying session data is the artifact. That's intentional — independent reproducibility is the point.

Try it

Three concrete next steps:

Run a benchmark against your own store. Create a collection at ucpplayground.com/evals, pick a sequence, pick two models, and compare your store's per-model performance against the aggregate above.
See where individual models stand. Each model on the leaderboard has its own shopping profile with detailed performance data, known issues, and store-by-store breakdowns.
Compare two models head-to-head. The comparison view lets you pit any two models against each other on the same workload — useful before you commit to a primary model for a deployment.

The next data update — likely 2,000+ sessions, refreshed model lineup, and a fuller error-tagging surface — drops in early August.

UCP Requirements: What Your Store Needs Before Going Live

Benji Fisher — Mon, 04 May 2026 12:23:16 +0000

What do you need for UCP? There are two levels of UCP readiness. The first is the minimum viable manifest — the bare requirements to pass validation and appear in the UCP directory. The second is the agent-ready setup — what it actually takes for an AI agent to browse, cart, and check out at your store without friction.

Think of this as your UCP checklist — the minimum requirements plus the recommended prerequisites that separate stores agents can find from stores agents can actually shop. Most guides only cover the first level. This one covers both, grounded in data from 4,024 verified merchants and hundreds of agent testing sessions.

Minimum requirements (pass validation)

These are the fields required to produce a valid UCP manifest on the current v2026-04-08 spec:

1. A JSON file at /.well-known/ucp

The manifest must be publicly accessible at https://yourdomain.com/.well-known/ucp, served with Content-Type: application/json, and reachable without authentication.

Platform notes:

Shopify: handled automatically
WooCommerce: manual publish via plugin or custom route
BigCommerce: manual, served from storefront origin
Magento: manual, typically via custom module

Full publishing guide with code examples: /.well-known/ucp developer reference.

2. ucp.version (required)

A string identifying which spec version the manifest is written against. Current latest: "2026-04-08".

99.4% of verified stores are on this version. If you're starting fresh, use it. If you're on an older version, the spec update post walks through the migration.

3. ucp.services (required)

At least one service entry declaring a transport (mcp, rest, a2a, or embedded) and an endpoint URL. This tells agents where to send requests.

MCP is the dominant transport — ~100% of verified stores declare it. If you're building from scratch, start with MCP. See the transport comparison for the tradeoffs.

4. ucp.payment_handlers (required)

A map of payment handler namespaces. Can be an empty object {} if your store uses checkout-link redirects instead of tokenized payments (common on WooCommerce).

If you declare handlers, use reverse-domain namespaces like com.stripe.card or dev.shopify.card. See the payment handlers directory for examples.

5. signing_keys (required, at root level)

An array of JWK objects at the document root (not nested inside ucp). An empty array [] is valid if you're not signing payloads yet, but the key must be present.

This field moved from ucp.signing_keys to the root in v2026-04-08 — the most common validation warning we see is stores that still nest it.

Recommended setup (agent-ready)

Passing validation gets you into the directory. The requirements below determine whether agents can actually shop your store — the difference between a B+ grade and an A grade in our benchmarks.

6. Capabilities declaration

The ucp.capabilities field is optional per spec but strongly recommended. Without it, agents know your store exists but not what it can do.

Declare every capability you support:

checkout — 99.5% adoption across verified stores
cart — 99.1% adoption
catalog-search — required for product discovery
identity-linking — 3 stores, massive first-mover opportunity
payment — 0 stores, the frontier

Full list: capability registry.

7. Clean variant data

Variant mismatches are the #1 failure mode in agent shopping sessions. Every variant needs a stable ID, a clear name, and consistent representation across discovery and checkout. This is the single highest-impact fix you can make.

8. Responsive MCP endpoint

Latency matters. The average Shopify store responds in ~130ms. BigCommerce stores average ~890ms. Agents have timeout budgets — if your endpoint is slow, sessions drop silently. Target under 500ms for tool responses.

9. robots.txt allowing AI crawlers

Make sure /.well-known/ucp is explicitly allowed in your robots.txt. Some WAFs and CDN configurations block well-known paths by default. Check the common errors guide for the fix.

10. Supported_versions for backward compatibility

Declare supported_versions in your manifest listing both the current and previous spec version. This lets agents that haven't migrated yet still find a valid endpoint:

"supported_versions": {
    "2026-04-08": "https://yourstore.com/.well-known/ucp",
    "2026-01-23": "https://yourstore.com/.well-known/ucp/2026-01-23"
}

The UCP readiness checklist

Requirement	Required?	% of stores that have it
Manifest at /.well-known/ucp	Yes	100% (by definition)
ucp.version	Yes	100%
ucp.services with transport + endpoint	Yes	100%
ucp.payment_handlers	Yes	100%
signing_keys at root	Yes	~97% (rest have it nested)
ucp.capabilities	Recommended	~99% (Shopify default)
Clean variant data	Recommended	Unknown (runtime issue)
Latency < 500ms	Recommended	~95% (Shopify), ~30% (others)
robots.txt allows /.well-known/ucp	Recommended	~99%
supported_versions	Recommended	~70%

Validate your setup

Not sure if you pass? Start with Is My Store UCP Ready? — it walks through the full diagnostic in 60 seconds. Or jump straight to the tool:

Run a live check on your domain — it tests every requirement above in seconds. For runtime issues (variant mismatches, checkout failures), test with real agents in Playground. For ongoing monitoring, set up alerts.

Once you're verified, make sure your listing on UCP Registry is accurate — that's what agents see when deciding which stores to route customers to. And if you're a developer building agents rather than stores, the Build an Agent quickstart covers the other side of the equation.

Check your store now at UCPChecker.com. See how you compare: side-by-side store comparison. Platform guides: Shopify · WooCommerce · BigCommerce · Magento

AI Commerce Needs MLPerf — and Here's an Early Attempt

Benji Fisher — Fri, 01 May 2026 12:07:45 +0000

Validating a UCP manifest takes a second. Scoring it for agent-readiness takes another. Neither of those answers the harder question: when a real frontier agent — Claude or GPT or Gemini, picked by a user three weeks from now — walks up to your store with an ordinary shopping prompt, does it actually complete a checkout? Compared to the next implementation? Across the models people are actually using?

Today there's no shared way to find out. AI commerce has the same coordination problem ML had before MLPerf, web performance had before Lighthouse, and coding models had before HumanEval — and the cost of not solving it is the same: every claim a vendor makes about agent-readiness is currently unverifiable by anyone outside that vendor.

This post is about what we've been building to close that gap.

The pre-benchmark moment

Every category that grew up around AI has gone through a pre-benchmark moment.

Machine learning before MLPerf was a pile of vendor-flavoured numbers. NVIDIA reported one set of throughput claims, Google another, AMD a third — and none of it was directly comparable, because nobody was running the same workload, on the same input, on the same harness. MLPerf — submitted to, run by, and audited across the whole industry — fixed that. Buyers could finally compare. The category matured.

Web performance before Lighthouse was the same. "Fast website" was vibes. PageSpeed Insights gave one number, WebPageTest another, internal RUM dashboards a third. Lighthouse — graded, reproducible, open — fixed it. Today nobody ships a serious site without checking their score.

Coding models before HumanEval were even worse. Every lab benchmarked against its own preferred problems and reported its own preferred metrics. HumanEval, then MBPP, then SWE-bench, then LiveCodeBench, gave the field a shared evaluation surface. Comparisons stopped being marketing.

Agentic commerce is in exactly the place those categories were before their benchmarks landed. The standard has converged — UCP is the open spec the industry is building against, and the public directory tracks 4,500+ verified stores. Major retailers and platforms ship UCP implementations almost weekly. The recent tech council expansion brings in most of the rest. But there is still no neutral, reproducible way to evaluate how well any of those implementations actually work when a real frontier agent tries to shop them.

You can't get this from inside a vendor. Shopify cannot credibly benchmark Shopify stores. OpenAI cannot credibly benchmark OpenAI agents. Even when their numbers are honest, the methodology is theirs, the test conditions favour their stack, and nobody else can rerun it. AI commerce has the same coordination problem ML had before MLPerf, and it solves the same way: a shared evaluation layer, run by a third party, that anyone can audit and reproduce.

Agentic commerce can't mature without that layer. We've built a first credible attempt at one.

What UCP Playground Evals does

UCP Playground Evals is a benchmark framework for agentic commerce. You define a multi-turn shopping conversation, pick the stores and the models you want to evaluate against it, and get back a structured comparison report — funnel matrix, per-session token and duration metrics, error classification, replayable session links, downloadable PDF.

The point isn't the report format. The point is the three properties underneath, because those determine whether a benchmark is worth trusting.

1. Standardised, multi-turn sequences

Agentic commerce is conversational, not single-prompt. A real shopping session looks like "Show me products under $60" → "Add both to my cart" → "Proceed to checkout", with full context carried across turns. That's the unit an eval has to operate on.

Each eval is a scripted sequence of turns. Every turn gets its own orchestrator round (up to 8 internal tool-calling sub-turns) and the full conversation history is preserved across the sequence — so the agent's choices on T2 are conditioned on what it actually saw on T1, the way real user behaviour conditions on real responses. Four collections ship today: Browse & Buy (4 turns, generic shopping journey), Multi-Item (3 turns, multi-product cart composition and checkout), Price Constrained (3 turns, budget-anchored reasoning across a single purchase), and Custom for user-defined sequences.

2. Cross-store comparability

The sequences are intentionally generic. Not "Find Nike Air Max 90 in size 10" but "Show me products under $60". That distinction is load-bearing: it's what makes the same test valid against any store running UCP, and it's what makes results from one store directly comparable to results from another. Without it, every benchmark is apples-to-oranges and nothing aggregates.

The eval runner discovers MCP endpoints automatically from each store's /.well-known/ucp manifest, so any UCP-conformant store works without per-store wiring — Shopify, WooCommerce, BigCommerce, Magento, PrestaShop, and Custom & Headless stacks all work the same way.

3. Multi-model coverage

The same sequence runs against any of 15 frontier models currently wired up — every major lab, plus a reasoning-tuned subset:

Model	Provider	Type
Claude Opus 4.6	Anthropic	Frontier
Claude Sonnet 4.5	Anthropic	Frontier
GPT-5.2	OpenAI	Frontier
GPT-4o	OpenAI	Frontier
Gemini 3.1 Pro	Google	Frontier
Gemini 3 Flash	Google	Frontier
Gemini 2.5 Pro	Google	Frontier
Gemini 2.5 Flash	Google	Frontier
Grok 4	xAI	Frontier
DeepSeek V3.2	DeepSeek	Frontier
Llama 3.3 70B	Meta	Frontier
DeepSeek R1	DeepSeek	Reasoning
QwQ 32B	Alibaba	Reasoning
Grok 3 Mini	xAI	Reasoning
o4-mini	OpenAI	Reasoning

The model is part of the test matrix. Same store, different models, same sequence — directly comparable behaviour, with model-level differences surfaced rather than averaged away. Any two can also be compared side-by-side outside the eval framework, on the same workload.

The math is straightforward

stores × models × sequences = sessions. Two stores × two models × one sequence = four sessions. Each one is a full agent shopping run, captured end-to-end, replayable, and rolled up into the report.

Standardised, reproducible, vendor-neutral. The three properties that make a benchmark worth trusting. Everything else in the framework is built to defend those three.

What the framework actually surfaces

The clearest way to show what evals do is to walk through one. Below is a multi-item checkout report we ran across two stores and two Gemini models in March:

Download the full multi-item checkout report (PDF) →

Two-page report covering the funnel comparison matrix, per-session performance breakdown, evaluator configuration, auto-generated recommendations, and clickable session-replay IDs for every run.

Two stores (oakywood.shop, ugmonk.com). Two models (Gemini 3 Flash, Gemini 3.1 Pro). One sequence (multi-item checkout: search → add → checkout). Four sessions total. The headline numbers:

100% checkout rate across all four sessions
95,513 average tokens per session
48.3s average duration
0 errors across the matrix

That's the boring summary. The interesting parts are in the per-session table.

Store	Model	Tokens	Duration	Turns	Cart value
oakywood.shop	Gemini 3.1 Pro	85,614	93.4s	7	EUR 82.75
oakywood.shop	Gemini 3 Flash	154,294	34.7s	12	—
ugmonk.com	Gemini 3.1 Pro	46,084	35.1s	6	USD 77.00
ugmonk.com	Gemini 3 Flash	96,058	29.9s	11	—

Same sequence, same stores, two models. Gemini 3.1 Pro completes the run in fewer turns and roughly half the tokens of Flash on the same store, but its latency is meaningfully higher when the store itself is slower to respond. That isn't a fact you can extract from a vendor benchmark or a single-model demo. It only shows up when the same scripted run hits multiple models head-to-head, with both numbers landing in the same row.

The auto-generated recommendations point at where the real engineering work is, and they're grounded in the actual run data:

Average token usage is 95,513 — above the 40K baseline. Product descriptions may be inflating context. Consider truncating descriptions in MCP responses.

Average session duration is 48.3s — above the 15s target. Optimise MCP endpoint response times, especially initial search calls.

Those are concrete merchandising actions. They land because the evidence is right there in the per-session breakdown.

The deeper signal shows up across runs against richer stores. In a separate eval against a single shop, two models picked different variant IDs for "Medium" — one mapped Medium to one variant ID, the other to a different one, and neither is provably correct because the store doesn't expose a human-readable size axis in its variant data. That isn't a bug in either model. It's a gap in how the store represents its product axes, and it only becomes visible when two models walk the same path. This is the kind of behavioural divergence between frontier models that evals surface — and that vendor-internal benchmarks can't credibly report.

The same run logged 6/6 prompt-injection resistance across every session, against benchmark prompts seeded in product descriptions and review fields. Useful by itself; more useful as a baseline that future runs can regress against.

What's on the evals roadmap

This is v1. A few things on the roadmap, in priority order.

More eval collections. The four built-in sequences cover the core shopping flow. The next batch is more diagnostic: single-item flow (the simplest path), variant selection accuracy (the size-label gap above, formalised), prompt-injection resistance (already running, becoming its own collection), escalation handling (requires_escalation compliance), attribution accuracy (UTM and referrer handling at checkout hand-off), return policy surfacing.

Public benchmark leaderboards. Same pattern as the UCP Score leaderboard — by-store and by-model rankings against the standard sequences, refreshed on schedule, indexed and shareable. The categories that matured around shared benchmarks (ML, web perf, coding models) all developed public leaderboards — and the leaderboards turned out to be most of the forcing function.

Headless API and CI/CD integration. Already shipped. The full automation surface:

POST /api/v1/collections          — create
POST /api/v1/collections/{id}/run — trigger
GET  /api/v1/collection-runs/{id} — poll status + results
GET  /api/v1/collection-runs/{id}/pdf — download report

The first integration we expect anyone to ship is a deploy-time check: trigger an eval after every UCP manifest deploy, assert checkout_rate >= 80, errors.total == 0, avg_duration_ms < 30000, fail the build otherwise. Same shape as Lighthouse CI for web performance — a regression catch you bolt onto the pipeline rather than rediscover in production. Full developer documentation — authentication, rate limits, and a worked GitHub Actions example — lives at ucpchecker.com/developer-tools, alongside the rest of the public API surface.

Scheduled runs and version tracking. Also shipped. Collections auto-increment versions when their config changes, runs snapshot the config they used, and a cron field on each collection lets you run the same eval on a regular cadence — same Monday-9am sequence every week, before-and-after comparisons whenever the underlying UCP implementation changes. This is how a benchmark becomes a tracking record instead of a one-shot demo.

Cloning and team scoping. Public collections can be cloned into any team workspace; quotas are scoped per team. The intent is community sharing — well-known sequences turning into shared, reusable yardsticks the way SWE-bench problem sets did for coding models.

How evals fit the broader development cycle

Evals don't sit alone. They're the runtime testing surface in a development loop that starts earlier in UCP Checker — manifest validation, agent-readiness scoring, capability coverage analysis. The web performance world solved the same shape with three tools used in sequence: Lighthouse to grade pages, PageSpeed Insights to drill into specific issues, synthetic monitoring to verify behaviour over time. UCP implementations follow the same arc: validate the manifest at /check, score it against agent-readiness criteria with the UCP Score, then run evals against it to see how it actually behaves when a real frontier agent shops it.

Each tool surfaces something different. Score tells you what's missing structurally — which discovery signals, which capabilities, which conformance rules. Check confirms the manifest validates after fixes land. Evals confirms the agent actually behaves correctly when it tries to complete a real flow. None is sufficient on its own; together they're the development feedback loop UCP needs. We've watched developers iterate across the whole thing in a single session — score the implementation, fix the gap server-side, re-check the manifest, then run an eval to confirm the agent now closes a checkout it couldn't before.

If you're starting from zero on a UCP implementation, the natural sequence is: get a Score first to see what's missing, fix the highest-impact issues, run a Check to confirm the manifest validates cleanly, then run Evals to confirm real agents complete the flows you care about. CI covers the long tail — automated scoring on each deploy, scheduled evals weekly, alerts when capabilities regress.

Methodology and verification

Three properties separate a credible benchmark from a marketing claim. UCP Playground Evals are designed around all three.

Every result links to a replayable session. Each eval session generates the same agent_sessions data the public Playground UI produces — full tool-call timeline, model responses, token-by-token event stream, every retrieved page. The session IDs in any report are clickable. Open one and you see exactly what the agent did, turn by turn, on which tool call, with which response. The sample report above lists four such IDs (e.g. 01KMJZM5MG2CA4QN5M983H19E1) and each resolves to a full replay at ucpplayground.com/sessions/{id}. This isn't a marketing claim; it's a verifiable test you can audit.

Every collection is versioned. When the configuration of a collection changes — turns added, models swapped, store list updated — the version increments and every run snapshots the config it ran against. Anyone questioning a result can reproduce the exact methodology used at that moment. The PDF report itself prints the collection version at the bottom of every page; the sample above is Collection v3. Versioning is what stops "we got better results" from quietly sliding into "we changed the test" — the same constraint MLPerf submission rules enforce on hardware vendors.

The methodology is open. The framework configuration shape is documented — the turns, the orchestrator loop, the stop conditions, the success metrics, the PDF schema. Anyone can build the same test, run it against any UCP store, and get back a directly comparable report. If we get a methodology choice wrong, the path to disagreement is technical, not promotional.

That's the credibility floor. Everything else in the product builds on it.

About UCP Checker and UCP Playground

UCP Checker is the independent validation and monitoring layer for the Universal Commerce Protocol. We crawl, validate, and grade every public UCP manifest in the open web, run the merchant directory and the UCP Score, publish the leaderboard and adoption stats, and ship developer tools — the validator, bulk checker, browser extension, public dataset, and a public REST API. The whole dataset is open, indexed, and ungated.

UCP Playground is the agent shopping layer that sits next to it — same data model, same /.well-known/ucp discovery, same replayable session format. UCP Playground Evals is the benchmark surface on top of that. Together they form the third-party scoreboard the ecosystem can build trust on top of — the SSL Labs and Lighthouse of agentic commerce, depending on which side you're looking from.

Try it

The interesting eval gaps are the ones nobody's tested yet. If a result surprises you — your own store, a competitor's, a model you assumed was a clear winner that turns out not to be — let us know.

Three concrete next steps:

Run an eval against your own UCP store. Create a collection at ucpplayground.com/evals, pick a sequence, pick two models, run it. The four-session example above is the shape most first runs take.
Read a public eval report. Sample reports are linked from the framework page. Each has clickable session IDs you can replay end-to-end.
Wire it into CI. The developer tools page covers authentication, rate limits, and a GitHub Actions worked example. The assertion shape is the same one Lighthouse CI uses for web performance — checkout_rate, errors.total, avg_duration_ms instead of LCP and TBT.

Is My Store UCP Ready? How to Check in 60 Seconds

Benji Fisher — Thu, 30 Apr 2026 10:25:51 +0000

The short answer: enter your domain here and you'll know in under 60 seconds. This UCP ready check runs the same validation that AI agents use to decide whether your store is worth shopping.

The longer answer — what "UCP ready" actually means, why it matters, and what to do about the result — is what this post covers.

What UCP readiness means

A store is "UCP ready" when it publishes a valid manifest at /.well-known/ucp that AI shopping agents can discover, parse, and act on. That's the technical definition.

In practice, there are three levels:

Level 1: Verified

Your manifest exists, returns valid JSON, and passes schema validation against the current v2026-04-08 spec. You appear in the UCP directory. Agents can find you.

As of this month, 4,024 stores are at this level.

Level 2: Agent-functional

Agents can actually shop your store — not just discover it. Your MCP endpoint responds, your product data is clean, your checkout flow completes without errors. You score B+ or higher on the Playground leaderboard.

422 stores are at this level. The gap between "verified" and "agent-functional" is where most common errors live.

Level 3: Optimized

Agents complete purchases reliably across multiple models. Your variant data is clean, your latency is low, your capabilities go beyond the defaults. You score A. Only 9 stores are here today.

The UCP requirements checklist breaks down exactly what each level requires.

How to check your store

Step 1: Run the checker

Go to UCPChecker.com/check and enter your domain. When you check your UCP status, the checker will:

Fetch /.well-known/ucp from your domain
Validate the JSON against the current spec
Check your robots.txt for AI bot policies
Inventory your declared capabilities, transports, and payment handlers
Verify your UCP compliance and report every error and warning with specific error codes

The whole process takes about 1 second. You'll get a full diagnostic report on your status page.

Step 2: Read the result

Verified (green) — your manifest is valid. You're in the directory. Agents can find you. Check the warnings section for things to improve.

Invalid (amber) — your manifest exists but fails validation. The diagnostic panel shows exactly which fields are wrong or missing. Most invalid manifests are one fix away from passing — usually a missing required field or a misplaced signing_keys.

Not Detected (grey) — no manifest found at /.well-known/ucp. Your store isn't UCP ready yet. See the requirements post for what to publish.

Blocked (orange) — your robots.txt or firewall is preventing access to the manifest. The diagnostic will tell you whether it's a robots.txt rule or an HTTP-level block.

Step 3: Fix what's broken

The checker tells you what is wrong. Here's where to go for how to fix it:

Platform-specific guides: Shopify · WooCommerce · BigCommerce · Magento
Manifest reference: /.well-known/ucp developer guide
Error-by-error fixes: Common UCP errors
Spec changes: v2026-04-08 update

Step 4: Test with real agents

Schema validation tells you if your manifest is syntactically correct. It tells you nothing about whether an agent can actually buy something from your store. For that, you need UCP Playground — it runs real AI agent sessions against your store and shows you exactly where the flow breaks.

The agent testing data shows that the most common runtime failure is variant mismatches — clean product data matters more than perfect schema.

Step 5: Monitor

Your UCP endpoint is a live API. Platform updates, catalog changes, and CDN reconfigurations can break it silently. Set up UCP Alerts to get emailed the moment your status changes — before agents notice.

How you compare

Once you're verified, see how your store stacks up:

Compare side-by-side with a competitor or partner store — capabilities, transports, payment handlers, latency.
Browse your platform — see all verified Shopify, WooCommerce, BigCommerce, or Magento stores ranked by capability depth.
Check the leaderboard — stores graded A through F on real agent shopping performance.

Why this matters now

UCP adoption is accelerating. 1,400+ new merchants were discovered in April alone. Shopify migrated its entire fleet to the latest spec in four days. BigCommerce, WooCommerce, and Magento stores are appearing every week.

Am I UCP ready? The question isn't whether your store will need UCP. It's whether you'll be ready when agents start shopping — and they already are.

Before you check, it helps to understand the building blocks: capabilities define what your store can do for agents, payment handlers define how agents pay, transports define how agents connect, and product discovery is the flow agents actually run when they shop.

Make sure your listing on UCP Registry is accurate once you're verified — that's how agents find you in the first place.

Check your store now →

Build your own agent: developer quickstart. Understand the protocol stack: MCP vs UCP vs AP2. Monthly ecosystem data: State of Agentic Commerce.

Introducing the UCP Score: A 0–100 Agent-Readiness Grade for Every UCP Store

Benji Fisher — Wed, 29 Apr 2026 09:41:44 +0000

After every status check on UCPChecker, the same follow-up question lands in our inbox: "OK, my manifest is verified. But is it actually any good?"

That question comes from everywhere. Engineering leads who shipped a manifest last quarter and want to know if it would actually carry an agent through checkout. Platform teams pitching agent-readiness to merchants who need a number, not a status pill. Analysts trying to chart "how Shopify compares to WooCommerce" and finding that "verified" tells them next to nothing. Developers picking which UCP store to integrate with first. AI agent builders deciding whose endpoints to feature in demo flows. Store owners benchmarking against direct competitors before a quarterly review.

None of these audiences really care that a manifest exists. They care about how good it is. Whether it has the surface signals that keep AI shopping agents finding it. Whether the declared transports actually respond when you call them. Whether the spec and schema URLs in the manifest resolve, or quietly 404 the moment a strict agent tries to validate the response shape. The interesting answer is always graded.

Until today, the only way to answer that question on UCPChecker was to read every line of the validator output and squint. So we built the thing people were already trying to do manually.

Get a UCP Score for any domain at ucpchecker.com/score →

What the UCP Score is

A 0–100 composite grade that measures how agent-ready any UCP store actually is. Not "does the manifest exist" — that's the status page. How well does it work for agents.

The score maps to a single letter grade you can share, embed, or watch over time. Bands are deliberately calibrated to match Lighthouse and SSL Labs — A is meant to be hard to earn:

A (85–100) — Agent-ready. Valid manifest, strong discovery, broad capability coverage.
B (70–84) — Solid. Minor gaps or one weak category, agents can still transact.
C (50–69) — Partial. Manifest works but missing capabilities or surface signals.
D (30–49) — Weak. Manifest reachable but invalid or near-empty.
F (0–29) — Failing. Blocked, unreachable, or no manifest detected.

Every score breaks down into three weighted categories so you can see exactly where the points come from:

Agent Discovery (30%) — Can agents find and reach you? HTTPS, reachability, agent-friendly robots.txt, plus the surface signals that keep you in the conversation: /llms.txt, sitemap.xml, Open Graph tags, Organization JSON-LD, mobile viewport meta.
UCP Conformance (40%) — Does the manifest validate against the spec? Validity is 3× weighted in this category — an invalid manifest cannot score above ~50 here, regardless of how good the surface polish is.
Capability Coverage (30%) — What can an agent actually do at your store? Declared transports (REST/MCP/A2A), checkout, payment handlers, and breadth of capabilities. When functional probes run, declared transport endpoints that don't actually respond drag this score down.

The composite is a straight weighted average: Discovery × 0.30 + Conformance × 0.40 + Capabilities × 0.30. No tricks, no hidden weights. The full ruleset is documented in our methodology.

What you actually get

Every score URL is a live page at /score/{your-domain}, indexed and shareable. Open one and you don't just see a number:

Top priorities — The three highest-impact issues we found, ranked by impact × effort. Start here.
Impact vs Effort matrix — Quick Wins / Strategic / Incremental / Consider Later quadrants so you can plan a sprint instead of staring at a wall of warnings.
Recommendations with copy-paste fixes — Every flagged issue surfaces a snippet you can drop straight into your manifest, robots.txt, sitemap, or HTML <head>. Hit "Show fix", copy, paste, redeploy, re-check.
Platform-aware percentile — "You're at p72 latency vs the median Shopify store." Because comparing your latency against the whole directory is meaningless when half of it runs on a fundamentally different infrastructure profile.
Full check breakdown — Every signal we evaluate, grouped by category, with a "why it matters" paragraph alongside each check. No black boxes.
Save this report — We re-run the full check weekly and email you only when something material changes. Score drops, capability regresses, status flips. Free, no marketing, unsubscribe anytime.

The page is ungated. No signup, no paywall, no "create an account to see the breakdown." We're indexing every score — just like SSL Labs grades and PageSpeed scores. Public scores create a baseline and pressure for the ecosystem to improve, in the same way SSL grades did for HTTPS adoption.

Why we built it

The honest answer: verified-or-not is the wrong question now.

When the UCP spec first landed in January (v2026-01-11), finding a verified store at all was novel. The bar was "did anyone publish a manifest." The status page was the right product for that moment, and it still is for the discovery layer.

The directory has 4,500+ verified domains today. Verified isn't novel. The interesting question shifted to "how well does this thing actually work for agents," and nobody had a good answer to that — including us.

When we ran a deeper analysis for our April State of Agentic Commerce report, the gap was stark: out of 4,014 verified UCP stores, only 9 delivered a flawless end-to-end agent experience. A 0.2% flawless rate. The other 99.8% had a manifest published — they just didn't actually work as well as that manifest suggested. That gap between "verified" and "actually works" is the central infrastructure problem in agentic commerce today. The UCP Score makes that gap visible, measurable, and addressable.

There's a clear analogue: PageSpeed before Lighthouse. Pre-Lighthouse, web performance optimisation was vibes. People knew slow sites were bad and fast sites were good but couldn't quantify "how slow" or "compared to what." Lighthouse gave them three things — a graded score, a category breakdown, and copy-paste optimisations — and the field changed overnight. Nobody ships a serious site today without checking their Lighthouse score first.

The agentic commerce ecosystem is at exactly that pre-Lighthouse moment. There's no shared yardstick for agent-readiness. Stores have no way to tell whether the integration they shipped last month is competitive. Platform teams have no way to back up "our merchants are more agent-ready" with a number. AI agent builders have no way to filter "show me the stores most likely to actually complete a transaction."

The UCP Score is meant to be that yardstick. Lighthouse for agentic commerce.

How we built it (the short version)

Three signal sources, one composite:

Static analysis — The same manifest validator that powers /check and /ucp-validator. Validity, version format, signing keys, payment handlers — every spec rule turned into a check row.
Surface signals — Five public files and meta tags fetched in parallel: /llms.txt, /sitemap.xml, Open Graph, Organization JSON-LD, viewport. Presence + content captured (with a content hash for change detection on llms.txt so we can spot when a brand updates their LLM brief).
Functional probes (opt-in) — Two probe families. Transport probes hit each declared transport endpoint with a benign request (MCP gets a tools/list, REST/A2A get a GET). URL resolution probes fetch every spec and schema URL declared in the manifest. Probes only run on user-triggered checks — not on the 24h cron sweep, because hammering 4,500 merchants daily with a dozen extra HTTP requests each isn't neighbourly.

Each signal feeds one category sub-score (0–100), and the composite is the weighted average. Recommendations join error codes against a fix library so every flagged issue surfaces a copy-paste snippet — the same pattern Lighthouse uses for its audit list. The whole pipeline runs on the same 24h cycle as the rest of the directory; checks you trigger manually run the full probe stack.

If you want the deep version, the methodology page walks through every category, every check, every grade band, and the "what we don't score" list.

What you can do with it

A few workflows the score unlocks immediately:

Pre-merge gate — Add a check in your CI that fails the build if your /score/{domain} drops below B. Same pattern as Lighthouse CI. The score URL is stable and the JSON breakdown lands in the API soon.
Platform comparison — The /platforms page now shows average UCP Score by platform — Shopify vs WooCommerce vs BigCommerce vs Magento at a glance. Useful both for picking a stack and for benchmarking the one you're on.
Leaderboard — The leaderboard is now ranked by UCP Score with sortable columns for each sub-score. Filter by platform to see the top stores on your stack.
Monitoring — Save any report against your email. We re-run it weekly and alert you on regressions. Score drops, capability disappears, status flips — one email, free, no marketing.
Competitive benchmarking — Run Allbirds vs Casper and see grades side by side. The compare page picks up score data automatically.

What's next

This is v1. A few things already on the roadmap:

Score history & sparkline — Save a report and you'll see your score trend over time. We're tracking every check in our history table from day one, so the data exists; the visual lands shortly.
Score API — GET /api/v1/score/{domain} returning the full breakdown as JSON. The data feed is already public; the score endpoint is the same data behind a stable contract.
Spec-version-aware scoring weights — As new UCP spec versions land with new emphasis, scoring rules for each version live in config and absorb cleanly. Already version-aware for validation; widening to scoring weights too.

We've also taken pains to make the system absorb future spec releases without a rewrite. Static check copy lives in config, not hardcoded; new error codes plug into the recommendations engine via a single config entry. The next spec drop should land as a configuration change, not a refactor.

About UCP Checker

UCP Checker is the independent validation and monitoring layer for the Universal Commerce Protocol. We crawl, validate, and grade every public UCP manifest in the open web, run the public merchant directory, publish the leaderboard and adoption stats, and ship developer tools — the validator, the bulk checker, the browser extension, and now the UCP Score. Everything is free, indexed, and ungated; the dataset is published openly under CC-BY 4.0. Think of us as the SSL Labs of agentic commerce — the third-party scoreboard the ecosystem can build trust on top of.

Try it

Pick any domain. Type it into ucpchecker.com/score and you'll have a graded report in under a second. If you find a score that surprised you — yours or a competitor's — let us know. The interesting score gaps are the ones nobody's looked at yet.

Get a score: ucpchecker.com/score
See the leaderboard: ucpchecker.com/leaderboard
How it's calculated: ucpchecker.com/methodology
Compare two stores: ucpchecker.com/compare
Track adoption live: ucpchecker.com/stats
Get notified on changes: ucpchecker.com/alerts

UCP Tech Council Expands: What the Meeting Minutes Tell Us About Where the Protocol Is Heading

Benji Fisher — Sun, 26 Apr 2026 21:37:00 +0000

On Friday just gone, five of the largest technology companies in the world quietly joined the governing body of the Universal Commerce Protocol. No press release. No blog post. Just a commit to MAINTAINERS.md in the spec repository.

Amazon. Meta. Microsoft. Salesforce. Stripe. All now have seats on the UCP Tech Council — the body that reviews, debates, and approves every change to the protocol that AI shopping agents use to buy things.

We know this because we read the meeting minutes. Every week, the TC meets to debate spec changes, vote on PRs, and argue about how agent commerce should work. Most people in the industry don't read these minutes. We do — and what they reveal about where UCP is heading is more interesting than any announcement.

This is what the minutes tell us.

The expansion: who joined and why it matters

The Tech Council grew from roughly 12 seats to 16 members across 8 companies:

Company	Representatives	Role
Google	4 seats	Founding sponsor, spec steward
Shopify	4 seats (incl. 2 new)	Largest platform implementer
Amazon	Greg Smith (new)	The world's largest online retailer
Meta	James Andersen (new)	Social commerce, Instagram Shopping
Microsoft	Patrick Jordan (new)	Copilot, enterprise commerce
Stripe	Prasad Wangikar (new)	Payment infrastructure
Salesforce	Scot DeDeo (new)	Commerce Cloud, enterprise retail
Etsy	Imran Hoosain	Marketplace commerce
Target	Maxime Najim	Enterprise retail
Wayfair	Naga Malepati	Furniture/home goods

This isn't ceremonial. The TC has binding authority over spec changes — every PR that ships in a UCP release has been reviewed and voted on by this group. When Amazon and Stripe join that table, it changes what gets prioritised, what gets debated, and ultimately what the protocol becomes.

The meeting minutes from March 13 first mentioned the election process: seats rotating every six months, with growing partner interest. By March 27, six nominations had been received. The final review was scheduled for April 10. The MAINTAINERS.md update landed April 24.

The new members are already contributing. James Andersen (Meta) submitted PR #367 on April 17 — a documentation PR clarifying network token usage and PCI scope in card credentials. Patrick Jordan (Microsoft) contributed documentation accuracy fixes the same day. These aren't advisory seats. They're engineering seats.

What the meeting minutes actually say

We reviewed the six TC meetings from March 6 through April 17. Here's what's being debated, decided, and built — translated for a merchant audience.

Identity linking is the top priority — and it's hard

The single most discussed topic across all six meetings is identity linking — how an agent knows who the customer is across sessions, stores, and platforms.

The April 17 minutes show an active debate about OAuth 2.0 scope design: nested scopes vs flat scopes vs config maps. The TC favoured flat. PR #354 implements OAuth 2.0 as the foundation for identity linking with capability-driven scopes.

Why this matters for merchants: Identity linking is the missing piece that would let an agent complete a purchase without a checkout-page handoff. Right now, agents can browse and cart — but paying requires redirecting the customer to a human checkout flow. Identity linking + payment handlers would close that loop. Until then, agents rely on the transport layer to reach the store and the manifest endpoint for discovery. Our April state-of-commerce report showed only 3 stores out of 4,024 currently declare identity linking capability. The spec work happening now is what will eventually bring that number up.

Loyalty is being trimmed to ship faster

The TC has been debating loyalty schemas since March. PR #340 implements a loyalty extension for the checkout capability. The April 10 minutes note that the extension is being "trimmed to baseline use cases" — a pragmatic decision to ship something that works for simple loyalty programs now, rather than waiting for a comprehensive solution that handles every edge case.

Why this matters: If your store has a loyalty or rewards program, the spec is building the infrastructure for agents to verify loyalty status and redeem points as part of the checkout flow. This is early — don't build against it yet — but understand that it's coming and it's being shaped by people at Google, Shopify, Etsy, and Target who run real loyalty programs.

Local commerce is on the roadmap

The April 3 minutes list Q2 priorities. Among them: local commerce. PR #375 proposes store-based local inventory and fulfilment options — the infrastructure an agent would need to answer "is this product available at a store near me?"

This is Target and Wayfair territory. Both have TC seats. Both have store networks. The fact that local commerce is a Q2 priority with retail representation on the council suggests it's not theoretical.

Returns are "incredibly complicated"

The April 17 minutes include the most honest assessment we've seen in any spec discussion: returns are acknowledged as an "incredibly complicated domain." This is refreshing. Most protocol specs pretend returns are simple. UCP's TC is saying out loud that they're not, and that getting them right will take time.

PR #257 from the February cycle introduced a returns extension. It's still in review. The complexity is in modelling return windows, refund methods, partial returns, and eligibility rules — all of which vary by merchant, product, and jurisdiction.

Why this matters: Don't expect agent-managed returns in 2026. But understand that the protocol is building toward it, and the merchants who implement return policies as structured data (not just PDF links) will be ahead when it ships.

The spec itself just shipped its biggest release ever

v2026-04-08 landed with 60+ merged PRs — the largest release since the protocol launched. Key additions:

Cart capability — basket building for agents, a prerequisite for multi-item flows
Catalog search + lookup — formalised product discovery as a spec capability
Request/response signing — cryptographic integrity for agent-store communication
Error handling overhaul — first-class errors, business logic error types
Eligibility claims — for loyalty, membership, and verification-gated pricing
Discount extension to cart — discounts now apply pre-checkout, not just at checkout
Risk signals — authorization and abuse metadata for fraud prevention

Our crawler showed Shopify migrating its entire fleet to v2026-04-08 in four days. 99.4% of verified stores are now on the latest spec.

What this means for you

If you're a merchant

The governance expansion doesn't change what you need to do today. Your UCP requirements are the same: valid manifest, declared capabilities, clean variant data. Check your store, fix any common errors, compare against competitors, and set up alerts so you know if anything breaks.

What it does change is the timeline and the confidence. When Amazon, Microsoft, and Salesforce have engineering seats on the governing body, the protocol is not going away. If you've been waiting for a signal that UCP is "real enough" to invest in — five of the ten largest technology companies joining the TC in a single commit is that signal.

If you're a platform

If you run Shopify, you're covered — platform-level UCP support is mature. If you run BigCommerce, WooCommerce, Magento, or a custom stack, watch the identity linking and loyalty PRs. These are the capabilities that will differentiate agent-ready platforms from agent-compatible ones in H2 2026.

Salesforce Commerce Cloud now has a seat at the table. If you're on SFCC, this is the clearest signal yet that platform-level UCP support is coming. Our April report noted that we've already seen SFCC engineering work in progress.

If you're building agents

The Build an Agent quickstart still works — the protocol surface you're building against is stable. But start tracking the identity linking PRs. When that capability ships, the agent flow goes from "browse + cart + redirect to checkout" to "browse + cart + pay" — end-to-end autonomous purchasing. That's the step change.

Check the store leaderboard to find the highest-performing targets, understand how product discovery works, and test your agent against real stores in UCP Playground and use UCP Registry for production discovery. Both will surface the new capabilities as they ship.

The reading list

For anyone who wants to follow the protocol's evolution themselves:

Meeting minutes: github.com/Universal-Commerce-Protocol/meeting-minutes
Spec repo: github.com/Universal-Commerce-Protocol/ucp
v2026-04-08 release notes: github.com/Universal-Commerce-Protocol/ucp/releases/tag/v2026-04-08
MAINTAINERS.md: github.com/Universal-Commerce-Protocol/ucp/blob/main/MAINTAINERS.md
Active PRs: github.com/Universal-Commerce-Protocol/ucp/pulls

We'll continue monitoring the spec, the TC minutes, and the 4,500+ merchants building on the protocol. If any of the Q2 priorities (identity, loyalty, local commerce) ship in spec form, we'll cover them in the May state-of-commerce report.

Check your store's UCP status at UCPChecker.com. Browse verified stores at UCPRegistry.com. Test agent performance at UCPPlayground.com. Read the full protocol stack: MCP vs UCP vs AP2.

Agentic Commerce Optimization: What 4,491 Merchants Reveal About UCP Readiness

Benji Fisher — Thu, 23 Apr 2026 11:53:42 +0000

Agentic Commerce Optimization: What 4,491 Merchants Reveal About UCP Readiness

Every UCP technical guide tells you how to get UCP ready. We decided to measure who actually is.

Since UCP launched, UCP Checker has tracked 4,491 merchants — 4,024 of which are verified and actively serving UCP endpoints. We maintain the largest UCP index of live merchant implementations, and the data tells a story that no theoretical guide can. We've run over 1k agent testing sessions in UCP Playground, consumed 43 million tokens doing it, and watched real AI agents attempt to browse, cart, and buy products across every major ecommerce platform. The result isn't a theoretical framework for agentic commerce optimization. It's a field report.

And the field looks very different from what the guides tell you.

What "Agentic Commerce Optimization" Actually Means When You Have Data

The term "agentic commerce optimization" — or ACO — has entered the SEO lexicon as a catch-all for making your store ready for AI-powered shopping agents. Most of the early writing treats it like a checklist: add Schema.org markup, update your Merchant Center feed, structure your product data. That advice isn't wrong. It's just incomplete, because it's built on assumptions about how agents will behave rather than observations of how they actually do.

ACO, measured empirically, is the practice of optimizing your ecommerce stack for the specific patterns that AI agents exhibit when they interact with UCP endpoints. Those patterns are surprising. Agents don't browse the way humans do. They don't use carts the way humans do. And the failure modes that block them from completing purchases are not the ones you'd predict from reading the spec alone.

The data we've collected across 4,024 verified UCP merchants tells a concrete story about what matters, what doesn't, and where the real optimization opportunities are hiding.

The Real State of UCP Readiness

Let's start with what's working. Of the 4,024 verified merchants in UCP Registry — the open UCP directory where agents discover merchants — capability adoption breaks down like this:

Checkout: 4,003 merchants (99.5%)
Cart: 3,987 merchants (99.1%)
Product discovery: Near-universal
Identity: 3 merchants
Payment: 0 merchants

Read those last two numbers again. Three merchants support identity. Zero support native payment. This is the defining feature of UCP's current state: the bottom of the funnel is wide open, but the capabilities that would make agentic commerce truly autonomous — knowing who the customer is and processing payment without a handoff — are functionally nonexistent.

The spec migration numbers are more encouraging. When the v2026-04-08 specification dropped, 3,994 out of 4,022 tracked merchants had migrated within four days. That's a 99.3% adoption rate in under a week, which speaks to the platform-driven nature of UCP rollout. Most merchants aren't manually implementing UCP. Their platform is doing it for them, and the platforms shipped the update fast.

Platform-by-Platform Reality

The theoretical guides will tell you that UCP readiness is about your structured data and feed configuration. In practice, it's mostly about which platform you're on. Here's what we've seen across the major players.

Shopify: The Default Winner

Shopify accounts for roughly 74% of identified platforms in our dataset (898 of the platform-identified merchants). This dominance isn't because Shopify merchants are more proactive about UCP — it's because Shopify rolled out UCP support at the platform level, giving every store baseline compliance automatically.

Out of the box, a Shopify store gets functional product discovery, cart, and checkout endpoints. The Schema.org markup is handled. The Merchant Center feed attributes are populated. For the average merchant, getting UCP ready on Shopify means verifying that your product data is clean rather than building anything from scratch.

The downside: Shopify's one-size-fits-all approach means limited customization of UCP behavior. If you need to implement conversational commerce attributes like substitution logic or compatibility data, you're working within Shopify's constraints. But for baseline agentic commerce readiness, nothing else comes close to the out-of-the-box experience.

WooCommerce: Flexible but Inconsistent

WooCommerce stores show the widest variance in UCP readiness. The open-source model means implementation quality depends entirely on which plugins a merchant has installed and how they've configured their stack. We've seen WooCommerce stores with excellent structured data and smooth agent interactions right next to stores where basic product attributes are missing or malformed.

The flexibility is a genuine advantage for merchants who want to implement advanced ACO features — conversational attributes, detailed return policies, rich product relationships. But the inconsistency is a problem for agents, which need predictable data structures to operate reliably. If you're on WooCommerce and serious about agentic commerce optimization, an audit of your specific UCP endpoint output is essential, not optional. Run your store through UCP Checker and see what an agent actually encounters.

BigCommerce: Strong APIs, Broken Images

BigCommerce has a genuine technical advantage in its API architecture. The platform's API-first design translates well to UCP's endpoint model, and the stores we've tracked generally produce clean, well-structured UCP responses.

But there's a specific, persistent issue: BigCommerce's S3-hosted image URLs break agent image parsing. This is a real failure mode we've observed in Playground sessions. When an agent can't parse product images, it loses a significant input signal for product matching and variant selection. For a platform that otherwise has strong UCP fundamentals, this is an unfortunate gap — and one that BigCommerce merchants should pressure their platform to fix. For now, it's worth investigating whether your image delivery pipeline produces URLs that agents can reliably consume. Our BigCommerce guide walks through the specifics.

Magento (Adobe Commerce): Enterprise Muscle, Enterprise Complexity

Magento implementations tend to be enterprise-grade, which means the UCP output is thorough but the setup complexity is high. These stores generally have rich product data, detailed catalog structures, and the kind of attribute depth that agents love. But the implementation burden falls more heavily on the merchant's development team compared to Shopify or BigCommerce, where the platform handles the heavy lifting.

If you're on Magento and aren't UCP ready yet, expect a meaningful engineering investment. If you have started, you're probably in good shape — the platform's data model maps well to what UCP expects, especially for multi-variant products and complex catalog hierarchies. See our Magento guide for implementation specifics.

What Agents Actually Do (vs. What Guides Tell You to Optimize For)

Here's where our data diverges most sharply from the advisory content circulating about UCP preparation.

Agents Skip the Cart

The conventional model of ecommerce — browse, add to cart, review cart, checkout — doesn't describe how AI agents behave. In our Playground data, we've recorded 395 checkout operations versus just 104 cart operations. Agents are going direct to checkout nearly four times more often than they're using the cart.

This has major implications for agentic commerce optimization. If you've invested heavily in cart-level features — upsells, cross-sells, minimum order messaging, cart-based promotions — agents are likely bypassing all of it. The checkout endpoint is where the action happens. Your optimization effort should weight accordingly — compare your store against competitors to see where you stand: make sure checkout handles single-product and multi-product flows cleanly, with clear variant specification and unambiguous pricing.

Variant Mismatches Are the Top Failure Mode

Cart variant mismatches remain the most common reason agent sessions fail to complete a purchase. An agent selects a product, identifies the desired variant (size, color, configuration), and submits a cart or checkout request with a variant ID that doesn't match what the endpoint expects. The session stalls or errors out.

This isn't an agent intelligence problem — it's a data clarity problem. Stores with clean, unambiguous variant structures and consistent ID schemes see dramatically higher agent completion rates. Stores with complex variant matrices, inconsistent naming, or variant IDs that change between API responses create confusion that even the best models struggle to resolve.

If you do one thing for ACO today: audit your variant data. Make sure every variant has a stable identifier, a clear human-readable name, and consistent representation across your discovery and checkout endpoints.

Token Consumption Tells You Where Agents Struggle

We've consumed 43 million tokens over 1,000 Playground sessions. The per-session cost varies dramatically based on store complexity and model choice, but a telling pattern emerges in checkout flows: completing a purchase takes approximately 55,000 tokens with the best-performing models.

That number is a proxy for friction. A 55K-token checkout means the agent is making multiple round-trips, parsing product data, resolving variants, handling errors, and re-trying. Stores that produce clean, predictable UCP responses see lower token counts — which directly translates to faster agent interactions and lower cost for the platforms running these agents at scale.

Model Performance Varies Significantly

Not all AI models handle UCP interactions equally. Claude Sonnet 4.5 leads our Playground leaderboard with 205 sessions, and the checkout completion rate across all sessions sits at 41%. That might sound low, but consider what it represents: four out of ten fully autonomous purchase attempts succeed end-to-end, without any human intervention, across a diverse set of merchants with varying UCP implementation quality.

The model performance gap matters for merchants because it signals where your UCP implementation has rough edges. If top-tier models struggle with your checkout flow, every agent will struggle. Testing your store in UCP Playground with multiple models gives you a direct read on where your implementation creates unnecessary friction.

The Capabilities Gap That Will Define Winners

Go back to those adoption numbers: identity at 3 merchants, payment at 0. These aren't just gaps — they're the entire frontier of competitive differentiation in agentic commerce.

Right now, every UCP checkout ends with a handoff. The agent gets the customer to the point of purchase, then drops them into a traditional checkout flow to enter their identity and payment information. That handoff is where conversion dies. Every redirect, every form field, every authentication step is a chance for the customer to abandon.

The merchants who figure out identity and payment first — who let an agent complete a purchase end-to-end without a handoff — will have a structural conversion advantage that no amount of Schema.org optimization can match. This is where UCP's roadmap points: loyalty integration, post-purchase management, multi-vertical capabilities. But the foundation is identity and payment.

We don't yet know what the winning implementation pattern looks like for these capabilities. The spec supports them, but the ecosystem hasn't built them. This is the space to watch, and the space where early investment will pay disproportionate returns.

An Optimization Checklist Grounded in Data

Most ACO checklists are derived from the spec. This one is derived from watching >1,000 agent sessions succeed and fail across 4,024 merchants. Here's what actually moves the needle, ranked by observed impact:

1. Fix your variant data first. Stable IDs, clear names, consistent representation across endpoints. This is the single highest-impact fix based on our failure-mode analysis.

2. Optimize for direct-to-checkout flows. Agents skip the cart. Make sure your checkout endpoint handles product selection, variant specification, and pricing in a single clean interaction.

3. Audit your product images. If you're on BigCommerce or any platform using CDN-hosted images with complex URL structures, verify that agents can parse your image URLs. Broken image parsing degrades product matching accuracy.

4. Migrate to the latest spec version immediately. The v2026-04-08 migration happened in four days across the ecosystem. If you're still on an older version, you're already behind 99.3% of verified merchants.

5. Test with actual agents, not just validators. Schema validation tells you if your markup is syntactically correct. It tells you nothing about whether an agent can actually complete a purchase. Run your store through UCPPlayground and watch what happens.

6. Validate your full UCP endpoint output. Use UCPChecker to see exactly what your store exposes to agents — capabilities, product data, structured attributes — and where the gaps are.

7. Clean up your Merchant Center feed. Return policies, product identifiers, and the native commerce attributes that feed into UCP discovery. This is table-stakes, but our data confirms that stores with complete feed data see higher agent engagement in discovery flows.

8. Start thinking about identity and payment. You won't implement these today — almost nobody has. But understanding the spec's identity and payment capabilities now positions you — our April ecosystem report tracks adoption monthly to move fast when the ecosystem catches up. The jump from 0 to first-mover will be worth more than incremental improvements to discovery or checkout.

9. Monitor your platform's UCP updates. If you're on Shopify, WooCommerce, BigCommerce, or Magento, your platform is doing most of the UCP work. Stay current with their releases — set up domain alerts to get notified when your store's status changes. Platform-level updates drove 99.3% spec migration in four days — the single most effective "optimization" most merchants can do is simply keeping their platform current.

10. Get listed in the UCP directory. UCPRegistry is the open UCP index where agents discover merchants. Your listing is what agents see when deciding which merchants to route a customer to. Make sure you're listed, your data is accurate, and your capabilities are competitive with peers in your vertical.

The Bottom Line

Agentic commerce optimization isn't a theoretical exercise anymore. UCP ecommerce is live, it's measurable, and it's growing fast. Our UCP index tracks 4,024 verified merchants serving UCP endpoints today. AI agents are completing purchases 41% of the time. The gap between being UCP ready and being UCP optimized is measurable in variant data quality, checkout flow design, and capabilities adoption.

The merchants who treat ACO as a data problem — not just a markup problem — are the ones who'll convert when agents come shopping. And agents are already shopping. We've got 43 million tokens of proof.

Check if your store is UCP ready at UCPChecker.com. Browse the UCP directory at UCPRegistry. Test agent interactions in UCPPlayground. Platform-specific implementation guides: Shopify · WooCommerce · BigCommerce · Magento.

The State of Agentic Commerce — April 2026

Benji Fisher — Sat, 18 Apr 2026 09:48:53 +0000

In March, we crossed 3,000 verified stores and started seeing the first non-Shopify platforms in the directory. We said the next question was whether UCP would remain a Shopify story or become a real multi-platform standard.

April answered that. We crossed 4,000 verified stores, Shopify migrated its entire fleet to the new v2026-04-08 spec in a four-day window, BigCommerce entered the directory with its first three stores, and WooCommerce and Magento integrations started appearing from independent developers. The ecosystem grew 33% in one month while simultaneously upgrading the protocol underneath.

This is the third monthly state-of-the-ecosystem report from UCP Checker. Here's what the data says.

The numbers

As of April 17, 2026:

4,014 verified UCP stores (up from ~3,000 in March, +33%)
4,481 total domains tracked
47,154 total checks run
1,436 new merchants discovered this month
866 new merchants this week alone
3,988 stores on the latest v2026-04-08 spec (99.4%)

The growth curve is worth examining. February was discovery: we scanned our first thousand Shopify stores and found UCP everywhere on the platform. March was expansion: we broadened the crawler, crossed 3,000, and started seeing non-Shopify manifests for the first time. April is consolidation: the store count grew 33%, but the more significant movement was the spec migration and the first signs of platform diversification.

The weekly run rate matters here. At 866 new merchants discovered this week alone, the ecosystem is adding roughly 125 stores per day. But the growth isn't organic in the way a consumer product grows — it comes in waves, driven by platform-level deployments. When Shopify flips a switch, hundreds of stores appear overnight. When BigCommerce ships UCP, three appear. The question for May isn't "how many stores" but "which platforms ship next" — because each platform deployment is a step function, not a slope.

The Shopify spec migration

This is the story of the month. Between April 13 and April 17, Shopify migrated nearly its entire UCP fleet from v2026-01-23 to v2026-04-08.

On April 13, our crawler showed 2 stores on the new spec. By April 17: 3,988. That's 3,986 stores upgraded in roughly four days — a coordinated platform-level migration, not individual merchants updating their manifests.

The v2026-04-08 spec introduced three breaking changes:

signing_keys moved from nested to root level. Previously at ucp.signing_keys, now at the document root alongside ucp. This is the structural change that required a manifest rewrite, not just a version bump.
Business profile distinction. The spec now formally separates business profiles (individual store manifests at /.well-known/ucp) from platform profiles, with different requirements for spec and schema fields on services and capabilities. Business profiles are lighter — spec and schema are optional.
a2a transport formally added. Google's Agent2Agent Protocol is now a recognised transport alongside REST, MCP, and Embedded, though adoption is effectively zero in the wild.

The migration means 99.4% of the verified directory is now on the latest spec. Only 26 stores remain on older versions: 19 on v2026-01-11, 6 on v2026-01-23, and 1 on v2026-01-14. These are almost entirely non-Shopify stores that need to upgrade manually.

For the full spec breakdown, see our v2026-04-08 spec announcement and the spec versions page.

Beyond Shopify: platform diversification accelerates

Shopify still dominates at 3,982 of 4,014 verified stores (99.2%). But the other 32 verified stores tell a more interesting story — these are developers who chose to publish a UCP manifest without a platform-level integration doing it for them.

BigCommerce entered the directory with its first three verified stores: untilgone.com, touchupdirect.com, and midwoodflowershop.com. All three are on v2026-04-08 with checkout and cart capabilities declared. Notably, their average manifest latency (~890ms) is significantly higher than Shopify's (~130ms) — BigCommerce manifests are served from the storefront origin rather than a CDN-cached endpoint. Platform-level latency differences like this will matter as agent response budgets tighten.

WooCommerce now has 3 verified stores, up from zero in March. These are hand-built integrations — WooCommerce doesn't have native UCP support, so each merchant published their manifest manually. We fixed a validation bug this month that was incorrectly rejecting WooCommerce manifests with payment_handlers: [] (valid for stores using checkout-link redirect flows).

Magento has 1 verified store. Custom/headless stacks account for 25 verified stores — the most architecturally diverse group, including our own ucpchecker.com manifest.

Salesforce Commerce Cloud has zero verified stores in the directory today. But industry signals suggest SFCC is exploring UCP support at the platform level — not as a one-off client integration, but as a feature that would ship to all Commerce Cloud merchants. If it follows the Shopify pattern — a single platform-level deployment bringing thousands of enterprise storefronts (Puma, Ralph Lauren, Under Armour, Adidas) into the ecosystem in one wave — the directory composition would shift significantly. SFCC is natively REST-based, so a REST-first UCP transport would be the natural fit, compared to Shopify's MCP-first approach. We're watching this closely.

The full platform breakdown is live on our new /platforms page.

How agents actually perform

The numbers above tell you which stores have UCP. This section tells you which stores work when an AI agent actually tries to shop them — and which models do it best.

Store benchmarks

Playground benchmarks grade stores A through F on end-to-end agent shopping performance:

Grade	Count	What it means
A	9	Agent completes the full flow flawlessly
B+	422	Works with minor issues — the largest cohort
B	222	Cart succeeds, checkout has friction
C+ / C	225	Discovery and browse work, deeper flow breaks
D	16	Significant failures across the flow
F	289	Manifest validates but the agent can't complete any step

The B+ tier at 422 stores is the most important number here. These stores are close — an agent can reliably discover, search, and cart them, but checkout friction (slow responses, variant mismatches, payment handler quirks) stops the flow short. The path from B+ to A is usually a single fix. The 289 F-grade stores are the other end: technically verified but functionally broken when an agent actually tries to shop them.

Model leaderboard

UCP Playground now supports 15 frontier LLMs from 7 vendors, tested against 76 unique stores, generating over $114,000 in aggregate cart value. The model leaderboard scores every model on search, cart completion, and checkout conversion:

Model	Shopping Score	Checkout %	Search %	Vendor
DeepSeek V3.2	63	53.1%	85.7%	DeepSeek
Gemini 3 Flash	59	51.4%	90.3%	Google
Grok 4	59	42.0%	92.0%	xAI
Claude Opus 4.6	52	41.9%	80.0%	Anthropic
Claude Sonnet 4.5	50	54.6%	86.8%	Anthropic

And the speed rankings — because latency is the other dimension that matters:

Model	Avg Session	Vendor
Gemini 2.5 Flash	~12s	Google
GPT-4o	~14s	OpenAI
Gemini 3 Flash	~17s	Google
Claude Opus 4.6	~31s	Anthropic
Grok 4	~76s	xAI

Three takeaways

DeepSeek V3.2 leads the leaderboard. An open-weight model tops the composite shopping score at 63 — ahead of every Anthropic, Google, and OpenAI model. The agentic commerce stack is genuinely model-agnostic in practice, not just in spec language.

Search works everywhere. Checkout is the bottleneck. Every model scores above 70% on product search. But checkout conversion drops to 13–56% depending on the model. The gap between "can find products" and "can actually buy them" is the reliability frontier for the ecosystem. This is where the work is.

Reasoning models underperform. QwQ 32B (0% checkout), o4-mini (16.7%), Grok 3 Mini (13.3%), and DeepSeek R1 (21.4%) all score below 40. Models optimised for chain-of-thought reasoning burn tokens on deliberation and struggle to execute the simple, sequential tool-call patterns shopping requires. The best shopping agents are fast and decisive, not thoughtful.

Full model profiles are on the Playground models page.

The reliability gap: verified is not ready

This is the editorial point we want to make clearly, because the headline number (4,014 verified stores) obscures the more important one: 9 stores score A.

Four thousand stores have valid UCP manifests. Nine of them deliver a flawless end-to-end agent shopping experience. That's a 0.2% flawless rate. The gap between "technically verified" and "actually shoppable by an AI agent without friction" is the central infrastructure problem for agentic commerce in 2026.

The B+ tier — 422 stores — is where the leverage is. These stores work most of the time. An agent can discover them, search their catalog, build a cart, and usually reach a checkout URL. But "usually" isn't good enough when the agent is spending someone's money. The failures at B+ level are specific and fixable:

Cart variant mismatches — the agent selects a size/colour variant that doesn't match the store's internal variant ID scheme. The cart call succeeds but adds the wrong item.
Payment handler timeouts — the tokenization step takes longer than the agent's timeout window, and the session drops silently.
Stale product data — the catalog returns products that are out of stock by the time the agent tries to cart them. No error — just an empty cart.
Checkout redirect loops — the checkout URL the store returns sends the agent into an authentication loop that a human browser would handle with cookies but an MCP client can't.

Each of these is a single-fix problem for the store operator. But at scale, across 422 stores, the aggregate effect is that agents fail more often than they succeed at the final step. The ecosystem doesn't need more stores. It needs the stores it has to work more reliably. That's the infrastructure investment that will actually unlock agent commerce at scale — and it's where we're focusing our tooling work for May.

Capability coverage: the ceiling hasn't moved

Across 4,014 verified stores:

Capability	Coverage	Stores
Checkout	99.6%	3,996
Cart	99.3%	3,985
Identity linking	0.07%	3
Payment	0%	0

Same pattern as March. Checkout and cart are effectively universal because Shopify ships them by default. The advanced capabilities — identity, loyalty, payment — haven't moved. The gap between "technically verified" and "deeply agent-ready" is still the story. Until more stores declare capabilities beyond the Shopify defaults, the ecosystem depth chart stays flat.

The broader ecosystem

April was quieter on the announcements front than March — which saw Splitit, PayPal, and Google all making public UCP commitments in a single week. But the signals that matter in April are structural, not press-release-shaped.

Shopify's fleet-wide spec migration is itself an ecosystem signal. It demonstrates that a major platform can coordinate a breaking spec upgrade across thousands of stores in days, not months. Every other platform considering UCP adoption now has a reference point for what a managed migration looks like. The v2026-04-08 changes (signing_keys relocation, business profile distinction) were non-trivial — and Shopify shipped them to its entire fleet without a single store going offline. That's the kind of platform engineering confidence that accelerates the next platform's decision to build UCP support.

The endorsed partner roster continues to grow. Adyen, American Express, Mastercard, Stripe, Visa, Checkout.com, Affirm, Splitit, and PayPal are all publicly committed to the protocol's payment layer. For any platform evaluating UCP, the payment handler ecosystem is no longer a gap — it's arguably the most mature part of the stack.

The model ecosystem is widening faster than the store ecosystem. In February, we tested 3 models. In March, 8. In April, 16 — from 7 vendors across the US, China, and Europe. The number of AI models that can speak MCP and execute a UCP shopping flow is growing faster than the number of stores that can serve one. This suggests the bottleneck is shifting from "agents that can shop" to "stores that can be shopped reliably" — which circles back to the reliability gap above.

What we shipped

Heavy shipping month on the tooling side:

Side-by-side store comparison — compare any two stores head-to-head on metrics, capabilities, transports, and payment handlers. Embeddable via iframe for blog posts and docs.
Platform pages — live landing pages for Shopify, BigCommerce, WooCommerce, Magento, and Custom. Leaderboards, capability coverage, and transport adoption — auto-populates as stores verify.
/.well-known/ucp developer guide — field reference, minimal examples, publishing guides for Nginx/Cloudflare/Node, the six most common validation mistakes.
Product discovery guide — the MCP tool call sequence agents use to find and buy products. Live demo, discovery-ready stores, three-way CTA to Playground + Registry + Rails.
Build an Agent quickstart — from zero to a working agent in 30 minutes. Copy-paste code in Python and TypeScript.
Spec validation fixes — accepted the payment.handlers nested format (WooCommerce), downgraded empty payment_handlers: [] from hard fail to warning, upgraded our own manifest to v2026-04-08.

What to watch in May

Salesforce Commerce Cloud. First platform-level deployment from the enterprise tier would be the most significant ecosystem event since Shopify's initial rollout. We'll catch any SFCC store that publishes on the next crawl.

The B+ → A path. 422 stores are one fix away from flawless agent shopping. We're building tooling to surface the specific issue per store so operators can action it.

Non-Shopify growth rate. 32 non-Shopify stores this month vs ~15 last month. If this doubles again in May, UCP stops being a "Shopify project" and becomes a genuine multi-platform standard.

AP2 / A2A adoption. Zero stores declare either protocol. The v2026-04-08 spec formally added a2a as a transport. First adopter will be notable.

Sources

All data comes from the UCP Checker crawler, which re-checks every tracked domain at least every 24 hours. The raw verified-merchant dataset is published monthly on Hugging Face under CC-BY 4.0.

Browse the directory: ucpchecker.com/directory
Track adoption live: ucpchecker.com/stats
Compare two stores: ucpchecker.com/compare
Platform breakdown: ucpchecker.com/platforms
Build your own agent: ucpchecker.com/agents

MCP vs UCP vs AP2: What is the Difference?

Benji Fisher — Thu, 16 Apr 2026 10:21:57 +0000

Every week we get a version of the same question from developers reaching out about UCP Checker: "OK, but should I actually build on MCP, UCP, or AP2?"

It's a reasonable question. The three protocols get lumped together in keynote slides and vendor blog posts, each positioned as "the" standard for how AI agents and commerce should talk to each other. If you're deciding what to implement this quarter, the marketing makes it look like a fork — pick one, live with it.

Here's the honest answer from running the only continuously-updated UCP directory of 3,643+ verified stores (as of April 13, 2026):

MCP, UCP, and AP2 aren't competitors. They're stack layers. If you're doing agentic commerce seriously, you'll end up using all three — and they fit together more cleanly than the messaging suggests.

This post is the argument, anchored in real adoption data from the UCP Checker directory.

The stack, in one diagram

┌─────────────────────────────────────────┐
│                                         │
│   AP2       ← payment authorization     │
│             (fits inside UCP's          │
│              payment_handlers)          │
│                                         │
├─────────────────────────────────────────┤
│                                         │
│   UCP       ← the shopping contract     │
│             (what a store sells,        │
│              what capabilities exist,   │
│              which transports to use)   │
│                                         │
├─────────────────────────────────────────┤
│                                         │
│   MCP       ← tool invocation           │
│             (how the agent actually     │
│              calls discover-store,      │
│              search-catalog, etc.)      │
│                                         │
└─────────────────────────────────────────┘

Read that top-to-bottom: an AI shopping agent opens a session, discovers a store via UCP, calls the store's tools via MCP, and — when it's ready to pay — hands off to AP2 for the payment authorization flow.

That's the whole thing. Now the detail.

What MCP actually is

Model Context Protocol is Anthropic's open protocol for connecting AI models to tools, resources, and data sources. It is not commerce-specific. MCP is how Claude Desktop talks to your filesystem, how Cursor talks to your database, how an agent in any environment calls a "list files" or "search knowledge base" tool.

It's JSON-RPC over stdio, SSE, or HTTP. It defines a handshake, a tool-description schema, a session lifecycle, and a notification system. When an agent wants to "call a function" on an external system, MCP is the envelope that carries the call.

In the context of shopping, MCP is the mechanism UCP uses to dispatch commerce tool calls. When an AI agent runs search-catalog against a verified store, it's sending an MCP tool-call message to the endpoint declared in that store's UCP manifest. The fact that MCP is involved is a transport detail. The fact that a store supports UCP search at all is what the agent actually cares about.

Here's the adoption data from our directory right now: of 3,643+ verified UCP stores as of April 13, 2026, effectively 100% declare MCP as one of their transports. MCP is the de facto transport for UCP. Not because MCP "won" a protocol war, but because there's nothing else that does what it does at this layer.

If you want to see the exact transport mix, the live breakdown is on the transports page. It's been MCP-dominant since day one.

What UCP actually is

Universal Commerce Protocol is the open standard for agentic commerce specifically. It answers questions MCP doesn't:

How does an agent find this store in the first place? (Answer: a well-known manifest at /.well-known/ucp.)
What can the store do? (Answer: a declared set of capabilities — checkout, cart, catalog-search, identity-linking, buyer-consent, fulfillment, and so on.)
How does the agent talk to it? (Answer: one or more declared transports — REST, MCP, A2A, or Embedded. MCP being by far the most common in practice.)
What payment methods does the store accept? (Answer: a declared list of payment handlers — Stripe, Google Pay, Shop Pay, and others — with enough detail for an agent to tokenize a card.)

UCP is the contract. It's a JSON document agents fetch before they do anything else. Without a valid UCP manifest, an agent doesn't know what your store can do, which tools it exposes, or how to pay. It can scrape your HTML like any other crawler — and most of them will — but the experience is slow, unreliable, and breaks at checkout more often than it succeeds.

UCP's job is to be the single document that makes a store shoppable by agents. MCP is the mechanism it points at. AP2 fits inside its payment handler list.

What AP2 actually is

Agent Payments Protocol is Google's specification for how AI agents authorize and execute payments. It's much more specialized than either MCP or UCP: it's about the money layer specifically — consent, authorization, dispute handling, the cryptographic mandate that lets a specific agent run a specific transaction.

AP2 is newer than both MCP and UCP, and its adoption numbers reflect that. At time of writing, we have zero stores in the directory declaring an AP2 payment handler, compared to dozens declaring Shop Pay, Stripe, Google Pay, and the various tokenizer namespaces.

That might sound like a strike against AP2. It isn't. AP2 is a different kind of thing — it's not a transport and it's not a capability declaration, it's a protocol for the auth step that happens after the agent has already selected items and built a cart. The stores that will eventually use it will declare it as a payment handler namespace inside their existing UCP manifest. UCP is the envelope that carries AP2 into the agent commerce ecosystem.

When AP2 adoption starts showing up in the directory, UCP Checker will catch it automatically on the next crawl cycle. We'll know because we track every payment handler namespace across every verified store, and we'll be able to tell you exactly which stores flipped first. That's the kind of thing the directory is for.

How they actually compose — a worked example

Picture an AI shopping agent asked to buy a pair of shoes. Here's what happens in a UCP-verified flow:

UCP discovery. The agent fetches https://allbirds.com/.well-known/ucp. It parses the JSON, reads the capability list (checkout, cart, catalog-search, payment, identity-linking…), picks the transport it wants to use from the declared list (it picks MCP because it's listed first and the agent speaks MCP), and notes the payment handlers this store accepts.
MCP session. The agent connects to the MCP endpoint declared in the UCP manifest. It opens a session, lists the available tools, and calls search-catalog({query: "running shoes"}). The store responds with a list of products.
More MCP. The agent calls add-to-cart({variant: "...", quantity: 1}). The store responds with a cart state and a checkout URL.
Payment hand-off. The agent needs to pay. It looks at the store's declared payment handlers. If one of them is an AP2 namespace (not yet, but eventually), it runs the AP2 authorization flow — getting consent, building the mandate, submitting the authorization. If not, it falls back to tokenizing a card via the declared payment handler's tokenization spec (Stripe, Google Pay, Shop Pay, etc.).
Confirmation. The agent calls order.create (another MCP tool call, same session, same transport) and gets back an order confirmation.

UCP, MCP, and AP2 were all involved in that flow — at different layers, for different purposes. None of them could have replaced the others. That's the whole argument.

You can see this flow literally running against a verified store in the live agent demo on our homepage — it's a real AI agent doing the above against a real UCP-verified store, step by step, with the actual tool calls shown alongside.

Why the confusion exists

Because each protocol's marketing wants to be the center of the conversation.

MCP positioning tends to be "the universal way agents talk to tools" — which is true, but "tools" is the operative word, and commerce is one vertical among many.

UCP positioning is "the open standard for agent commerce" — which is true, but UCP's adoption in practice depends on having a transport layer (MCP) and can optionally delegate payment authorization to (AP2).

AP2 positioning is "the protocol for agent payments" — which is true, but payment is one step in a much larger commerce flow that needs UCP to frame it and MCP to dispatch it.

Each protocol's marketing is correct about its own layer. The confusion comes from each one acting like it's the whole stack. It isn't.

Common comparison questions

These are the exact questions we see most often — the ones AI agents route to this post, the ones developers Google before they commit to a protocol. Quick, direct answers anchored in the UCP directory data.

What is the difference between MCP and UCP?

MCP is a tool invocation protocol. UCP is a shopping contract. They operate at different layers and you use both.

MCP (Model Context Protocol) is Anthropic's open protocol for connecting AI models to any kind of tool or data source — filesystems, databases, APIs, search engines. It's domain-agnostic by design. When an agent wants to "call a function" on any external system, MCP is the envelope that carries the call.

UCP (Universal Commerce Protocol) is the open standard for agentic commerce specifically. It answers questions MCP doesn't: how does an agent find a store, what can it do there, which payment methods does it accept. UCP's job is to be the single /.well-known/ucp manifest that makes a store discoverable and shoppable by agents.

The relationship in practice: MCP is UCP's dominant transport. Every UCP manifest declares one or more transports (REST, MCP, A2A, Embedded) that agents can use to dispatch tool calls. Across the 3,643+ verified stores in our directory as of April 13, 2026, effectively 100% declare MCP. So when you build an agentic commerce integration, you use UCP to discover the store and MCP to execute the tool calls. Not one or the other — both, in order.

What is the difference between AP2 and UCP?

AP2 is a payment authorization protocol. UCP is the full shopping stack. AP2 is one thing that fits inside UCP, not a replacement for it.

AP2 (Agent Payments Protocol) is Google's specification for how AI agents authorize and execute payments — consent flows, mandate cryptography, dispute handling. It's deliberately narrow: AP2 is about the money step, not the browsing or cart-building steps that come before it.

UCP covers the whole shopping flow: discovery (what does this store sell), browsing (catalog-search, cart), and the payment layer. UCP's payment handlers section in every manifest is a map of payment handler namespaces — Stripe, Google Pay, Shop Pay, and eventually AP2 when it reaches adoption. AP2, when it ships in production stores, will show up as a payment handler namespace inside an existing UCP manifest, sitting alongside the other tokenization methods an agent can choose from.

Current adoption data as of April 13, 2026: zero stores in the UCP directory declare an AP2 payment handler. That's not a criticism — AP2 is newer than both MCP and UCP, and the rollout is gated on payment processors exposing it. But it makes the practical answer clear right now: you publish a UCP manifest today, you add AP2 later when your payment processor supports it. UCP is the umbrella; AP2 is one of the spokes it will eventually hold.

What is the difference between A2A and UCP?

A2A is a transport protocol (like MCP). UCP is the shopping contract. A2A is one of UCP's allowed transport options, not a competitor.

A2A (Agent2Agent Protocol) is Google's protocol for agent-to-agent communication — how two agents talk to each other directly without a human intermediary. It serves a similar role to MCP within the UCP stack: it's the mechanism an agent uses to dispatch tool calls against a store's endpoint.

UCP's v2026-04-08 spec lists four allowed transports: REST, MCP, A2A, and Embedded. A store can declare any or all of them in its manifest — and agents will pick whichever they support when they connect. A2A is formally on the list, same as MCP.

In practice, A2A adoption on verified stores is effectively zero today, versus MCP's near-100%. The reason isn't technical, it's ecosystem timing: MCP shipped earlier and got the first wave of tooling. A2A is a strong candidate for the second wave once agent-to-agent coordination (one agent buying on behalf of another, multi-agent fulfilment pipelines) becomes a common pattern. When that happens, stores will add A2A to their existing UCP manifests alongside MCP — not replacing it. The correct framing is still "A2A goes inside UCP," same as MCP does.

Which should you actually adopt

Depends on what you're building.

If you're a store owner or ecommerce engineer, your job is to publish a valid UCP manifest at /.well-known/ucp that declares your capabilities, transports (MCP will almost certainly be one of them), and payment handlers. You don't need to "pick" MCP — UCP will tell you to expose an MCP endpoint as one of its transports, and most of the tooling you'll find assumes MCP. AP2 you can add later, as a payment handler, when your payment processor supports it.

If you're an agent or tooling developer, you need to speak all three, in the right order. Fetch UCP first to discover the store. Use MCP to actually dispatch tool calls against the declared endpoint. Handle AP2 at the payment step if the store declares it. In practice you'll build a UCP client library that wraps all of this transparently.

If you're a payments company, AP2 is your layer. Your job is to get your payment processor's tokenization spec declared as a payment handler in UCP manifests across the ecosystem, and eventually to support AP2 mandates as the authorization step.

If you're analyzing the ecosystem, look at the UCP adoption data. MCP transport counts and AP2 payment handler counts are both measurable inside UCP manifests — which is why we surface them on the platforms, transports, and payment handlers pages. The question is never "which protocol won," it's "how many stores have it."

The one-line summary

UCP is the shopping contract. MCP is how agents dispatch tool calls against it. AP2 is how they authorize payments inside it. All three are required for a complete agent commerce stack, and none of them replaces the others.

If you're unclear whether your store is set up correctly, run a live check — we'll fetch your manifest, validate it against the current spec, and tell you exactly which transports and payment handlers you're declaring (and which you're missing). If you're evaluating two stores' UCP coverage side-by-side, the compare tool puts their capabilities, transports, and payment handlers in a single scannable view.

And if you're building something on top of the stack and want to know which stores have what — that's what the directory is for.

Check your manifest: ucpchecker.com/check
Compare two stores: ucpchecker.com/compare
Browse the directory: ucpchecker.com/directory
Developer guide to /.well-known/ucp: ucpchecker.com/well-known-ucp

Introducing Side-by-Side Store Compare: See How Any Two UCP Stores Stack Up

Benji Fisher — Mon, 13 Apr 2026 20:47:52 +0000

Three months into running UCPChecker, the most common follow-up question we get from anyone reading a status report is the same: "OK, but how does that compare to [other store]?"

That question comes from everywhere. Developers picking which store to integrate with first. Analysts tracking which platforms are pulling ahead in UCP coverage. Store owners benchmarking against direct competitors. Marketing teams putting together pitch decks about why their stack is more agent-ready than the next. Platform vendors comparing their hosted ecosystem against rival platforms. AI agent builders deciding which retailers to feature in demo flows.

None of these audiences really care about a single store's UCP coverage in isolation. They all care about how it stacks up against another. Whether Allbirds is more agent-ready than Casper. Whether Boden's Shopify implementation goes deeper than Born for Fashion's. Whether the brand they're about to integrate with has more capabilities than the one they're already integrated with. The interesting answer is always relative.

Until today, the only way to get that answer on UCPChecker was to open two browser tabs and squint. So we built the thing people were already trying to do manually.

Compare any two UCP stores side-by-side at ucpchecker.com/compare →

What it does

Pick two domains. Get every measurable UCP attribute laid out side by side in a single scannable view.

Headline metrics: status, UCP version, latency, capability count, transport count, payment handler count, HTTP status, robots.txt policy, platform. Quantitative cells highlight the leading side with a soft green left-border, so you can scan winners without reading numbers.
Capability matrix: every UCP capability declared by either store, bucketed into "Both stores", "A only", and "B only". Each chip links straight to the capability's deep-dive page, so if you spot a gap you can immediately see what it is and why it matters.
Transport diff: same Both / A only / B only treatment for REST, MCP, A2A, and Embedded.
AI bot access matrix: GPTBot, Google-Extended, ClaudeBot, Applebot-Extended, and CCBot — allowed, blocked, or unknown for each store.
Payment handlers: which payment methods each manifest declares, including the ones one side has and the other doesn't.
Pick another: tiny inline form pre-filled with the current side A so you can swap side B and re-run instantly.
Related comparisons: auto-suggested by capability overlap with side A — a useful map of who else is building similar agent surface in the same space.
Status-aware FAQ: the questions change depending on the matchup. Two verified stores get a different lead question than one verified vs one not-detected.

It's free, public, indexable, and works with any domain. If a store isn't already in our directory, we run a live check the first time you compare it.

Why we built it

Two reasons. The first one is the most honest.

Right now, UCP coverage is a moving target. Some stores have everything declared — checkout, cart management, identity linking, payment tokens, multiple transports. Other stores have a single capability and a single transport, technically verified but barely useful to an agent. The directory grade tells you "verified" or "not". It doesn't tell you whether one verified store is significantly more agent-ready than another.

That difference matters more every week. The teams building agentic commerce tooling — the ones picking which stores to index first, which to feature in demo flows, which to recommend to their users — they need a relative view. They need to know that allbirds.com's manifest goes three levels deeper than the manifest of an otherwise equivalent store. They were already opening two status pages and comparing fields by hand. We watched it happen in user sessions. Compare just makes that workflow native.

The second reason is more strategic. We've been quietly building the infrastructure for what we think will be the most important question in agentic commerce as it matures: not whether a store is verified, but who's pulling ahead. Compare is the first user-facing surface that exposes that question directly. There will be more.

How we built it (the short version)

The data was already there. Every Merchant in our database has its capabilities, transports, and payment handlers loaded as proper many-to-many relationships. The computation is just three set operations per relation: intersect, left-only, right-only. The hard part was deciding what to compare and how to render the diff so two columns of dense data still feel scannable on a phone.

Some of the design decisions worth calling out:

Alphabetical canonical URLs. /compare/casper.com/vs/allbirds.com 301-redirects to /compare/allbirds.com/vs/casper.com. Without that, every store-pair would generate two URLs and split its link equity in half. Pretty URL, single canonical, no duplicate content.

Sync check on first visit. If you compare a domain that isn't in our database yet, we run a fresh UCP check inline before rendering. The compare page never shows "no data" — it always has something to compare. Same pattern as the per-store status pages.

Noindex when neither side is verified. Two non-verified stores produces a thin page that would just pollute the search index. Those pages still work for visitors who land on them — they just don't get crawled. As soon as either side becomes verified, the page flips to indexable automatically.

A "winner" highlight, not a "winner" badge. The leading side on a quantitative metric (lower latency, more capabilities, fresher check) gets a gentle green left-border on its cell — but we never write the word "better" or "worse" anywhere. The data speaks for itself, and "better" isn't a value judgment we want to be making about other people's stores.

Status-aware FAQ that mirrors visible HTML to JSON-LD. Every compare page emits a real FAQPage schema with the same questions and answers a human reader sees. The FAQ branches based on the matchup so the lead question is always relevant to what you're looking at.

Try it

Four pairings we've been opening manually for weeks. Each one is a live embed of the actual comparison — the same data refreshes every 24 hours from our crawler.

allbirds.com vs casper.com — two well-known DTC brands, see how their capabilities differ.

boden.com vs kyliecosmetics.com — both verified Shopify stores, compare their depth.

hairlust.com vs thebodyshop.com — beauty vs hair, both verified.

bornforfashion.com vs casper.com — fashion vs sleep, contrasting capability surface.

Or just start typing two domains into ucpchecker.com/compare. Autocomplete suggests from the verified directory.

What's next

A few obvious extensions we're sitting on:

Embeds need more testing in the wild. We've shipped iframe and Markdown embeds and validated them locally, but the real test is seeing them deployed across Substack, Medium, Notion, GitHub READMEs, and the hundred CMSes we don't have on our test bench. If you embed a comparison and the layout breaks, tell us — we'll fix it fast.
postMessage iframe auto-resize. Right now embeds use a fixed iframe height (900px by default). Comparisons with sparse capability data leave whitespace; comparisons with dense data sometimes scroll. The cleanest fix is a postMessage handshake from the embed to the host page so the iframe sizes itself to its content. On the list.
Three-way and N-way comparison. Two columns is the right default — past two, the visual gets cramped — but for "which of these five Shopify stores has the deepest UCP implementation" type questions, we'll likely add a tabular wide-mode behind a separate URL.

Compare is the first product surface we've shipped that frames UCP coverage as a relative thing rather than a binary verified/not. It changes what you can ask. We're already seeing internal queries we couldn't run before — "show me every verified store that has cart management but is missing identity linking" is one diff away from being a real question someone outside our team can answer.

If you build something with it, or if you find a comparison that surprised you, let us know. The interesting comparisons are the ones we haven't thought to run.

Try it now: ucpchecker.com/compare
Browse the directory: ucpchecker.com/directory
Track adoption live: ucpchecker.com/stats
Validate a manifest: ucpchecker.com/ucp-validator
Get notified on changes: ucpchecker.com/alerts

UCP v2026-04-08 Spec Update

Benji Fisher — Sat, 11 Apr 2026 12:04:52 +0000

On April 9th, the UCP Technical Council shipped v2026-04-08 — the first spec bump since January's 2026-01-23 release. It's the largest single release in the protocol's history: 26 new features, 6 breaking changes, 19 documentation updates, and contributions from 15 first-time contributors.

This isn't a patch. It's the release where UCP stops being a checkout-and-order protocol and starts becoming a full commerce platform.

Here's what changed, what it means, and what you should do about it.

The headline features

Carts are now a first-class capability

The most consequential addition is formal cart support (#73). The dev.ucp.shopping.cart capability gives agents the ability to create, read, update, and manage persistent shopping carts — the workflow that dominates human e-commerce but has been entirely absent from agent commerce until now.

We wrote in March that only 2 out of 2,832 verified stores declared cart capabilities. That number was low because the spec itself hadn't formalized the capability. Now it has. The schema defines add-to-cart, remove, quantity updates, and cart retrieval. Discount extensions have been expanded to work with carts too, so agents can apply promo codes before checkout.

For agent developers: this is the capability that unlocks multi-step shopping. Instead of "find product → checkout immediately," agents can now build baskets, compare options, apply discounts, and let the user review before committing. Design for it now, even if adoption will take months to ramp.

Catalog search and product lookup

Agents can now discover what a store actually sells. The new catalog search and product lookup capabilities (dev.ucp.shopping.catalog_search and dev.ucp.shopping.catalog_lookup) give agents structured access to product discovery — search by keyword, filter by attributes, and retrieve full product details including variant IDs.

Previously, agents relied on unstructured HTML scraping or platform-specific APIs to find products before initiating checkout. Now product discovery is part of the protocol itself. This is the missing first step: an agent can search a store's catalog, find what it needs, add items to a cart, and check out — all through UCP.

We've added catalog detection to UCPChecker's capability tracking. As stores adopt these capabilities, you'll see them in the stats and on individual store profiles.

Request and response signing

Cryptographic signing (#156) is the security feature the protocol needed. Stores can now sign responses, and agents can verify they're talking to the real merchant — not a MITM or a spoofed endpoint.

The spec uses JWK-format public keys published in the discovery profile. What's notable is where those keys live: signing_keys has moved from inside the ucp object to the root level of the discovery profile, sitting as a sibling of ucp rather than nested within it. This is a structural change that affects how validators parse manifests.

We've updated our validator to handle both locations — the new root-level position for v2026-04-08+ manifests, and the legacy nested position for older versions.

The structural changes that matter

Business profiles vs. platform profiles

This is the change that will generate the most false positives if your tooling doesn't adapt.

v2026-04-08 formally distinguishes platform profiles (the full spec declarations that platforms like Shopify publish) from business profiles (what individual stores serve at /.well-known/ucp). The key difference: business profiles no longer require spec and schema URLs on capabilities, services, or payment handlers. Those fields are only mandatory at the platform level — stores inherit them from their platform.

This makes sense. A Shopify merchant shouldn't need to declare "spec": "https://ucp.dev/specification/shopping/checkout/" in their manifest — that's Shopify's concern, not the merchant's. But every validator that checked for these fields as required will now throw false warnings against perfectly valid business profiles.

UCPChecker has already updated its validation rules. If your store runs v2026-04-08, we'll validate against the business profile schema — no spurious warnings about missing spec or schema fields that your platform handles upstream. You can verify your store at UCPChecker.com.

Multi-parent capability extensions

Capabilities can now extend multiple parents with deterministic schema resolution. This sounds abstract, but it solves a real problem: capabilities like embedded checkout that need to compose behaviors from both the cart and checkout namespaces.

The extends field on capabilities now accepts an array of reverse-domain names instead of just a single string. Schema resolution follows a defined order, so there's no ambiguity about which parent's definition wins when there's a conflict.

Supported versions for backwards compatibility

Business profiles can now declare a supported_versions field — a map of older protocol versions to their profile URIs. This means a store can advertise "I speak v2026-04-08, but I also have a v2026-01-23 profile at this URL" — letting agents negotiate down to a version they understand.

For the ecosystem, this is important infrastructure. It means the v2026-01-23 → v2026-04-08 migration doesn't have to be a flag day. Stores can support both versions simultaneously while agents upgrade.

The breaking changes

Six changes in this release are marked breaking. Here's what they actually break:

Order schema: currency is now required (#283). Previously optional, the currency field on orders is now mandatory. If your implementation omits it, your order responses will fail validation against v2026-04-08. Fix: add the ISO 4217 currency code (e.g., "USD") to your order objects.

Authorization and abuse signals (#203). Stores can now communicate authorization requirements and abuse indicators to agents. This is new infrastructure for trust — stores can signal "this transaction requires additional verification" or "this request pattern looks suspicious" in a structured way.

Updated order capability (#254). The order schema has been restructured. If you're consuming or producing order responses, check your field names against the updated schema.

Embedded protocol error alignment (#325). Error responses in the embedded checkout protocol now follow UCP's standard error conventions. If you're parsing embedded checkout errors, the shape has changed.

Totals format change (#299). Total amounts now use signed_amount.json — a format that can represent both positive and negative values (for discounts, refunds). If you're reading totals as simple numbers, you'll need to handle the new format.

Identity linking reverted (#329). A previously planned identity linking change was reverted. If you implemented against an earlier draft, verify your identity handling matches the released spec.

What didn't ship

Worth noting what's still in progress. Loyalty capabilities (#251) and return extensions (#257) were tracked for this release but didn't make the cut. The cart capability landed; the full post-purchase lifecycle is still forming.

Per-capability versioning — the ability to bump individual capabilities without bumping the entire protocol — was discussed at the March TC meeting but deferred. The infrastructure for sub-repo versioning is being explored by TC members, but for now, breaking changes still bundle into full protocol bumps.

23 contributors, 15 first-timers

This release had contributions from 23 people, 15 of whom were first-time contributors. The contributor base has broadened beyond the founding companies: documentation fixes from independent developers, schema improvements from payment processors, and tooling contributions from platform teams.

Notable additions: endorsed partners now include Block, Fiserv, Klarna, Splitit, Affirm, and Checkout.com — payment infrastructure companies whose involvement signals where agent commerce payment flows are heading.

What to do now

If you're a store operator: Check your store at UCPChecker.com. We've updated our validation to v2026-04-08 rules, so you'll see an accurate assessment against the new spec. If you're on Shopify, your platform will handle the migration — watch for their update timeline.

If you're an agent developer: Start building for carts and catalog search. These capabilities will roll out through platform-level updates, and when they do, the adoption curve will look like checkout's did — slow for a few weeks, then near-universal overnight. The UCP Playground is the place to test agent interactions against stores that adopt early.

If you're a platform team: The business-vs-platform profile split is the structural change to focus on. Your merchants' profiles just got simpler (fewer required fields), but your platform profile got stricter (spec and schema URLs are mandatory). Review the spec version details and validate your platform-level profile against the new schema.

If you're building tooling: Update your validators. The signing_keys location change, business profile relaxation, and new capability schemas all affect validation logic. We've open-sourced our approach — check the methodology page for how UCPChecker handles version-aware validation.

The bigger picture

v2026-01-23 gave us checkout. v2026-04-08 gives us the rest of the shopping experience: discovery, carts, signing, and the groundwork for trust and authorization. The protocol is filling in the gaps between "an agent can technically buy something" and "an agent can shop the way a human does."

We reported in March that the gap between "has a manifest" (87%) and "an agent can actually buy something" (45% checkout rate) is where the real work lives. This spec release addresses the structural reasons for that gap — not by making checkout better, but by giving agents the capabilities they need for the steps before and after checkout.

We're tracking the v2026-04-08 migration wave across all monitored domains. Check the stats page for real-time adoption data, and subscribe to the weekly report for ecosystem updates as stores begin the transition.

Analysis based on the UCP v2026-04-08 release, published April 9, 2026. UCPChecker validation rules updated same day.