Forem: Phil Stafford

Anthropic Just Published a Kill Chain for AI Model Theft. Let's Break It Down.

Phil Stafford — Wed, 25 Feb 2026 13:15:35 +0000

Attack patterns, detection challenges, and defensive gaps from the industrial-scale distillation campaigns against Claude.

On February 24, Anthropic dropped a detailed report attributing industrial-scale distillation campaigns against Claude to three Chinese AI labs: DeepSeek, Moonshot AI, and MiniMax. The numbers: 24,000 fraudulent accounts, 16+ million exchanges, targeting reasoning, agentic tool use, coding, and computer vision.

The geopolitical framing is getting all the coverage. This piece is about the technical content — because what Anthropic actually published is a kill chain analysis for AI model capability extraction, and there are concrete takeaways for anyone building or defending systems that expose model capabilities through APIs.

The Attack Surface: Your Model's Output IS the Exfiltration Channel

Traditional data exfiltration moves data out through network channels, side channels, or compromised endpoints. Distillation flips this: the exfiltration channel is the product's intended interface. Every API response is a potential training sample. The model's designed behavior is the thing being stolen.

This means conventional API security — rate limiting, authentication, payload inspection, WAF rules — addresses the wrong layer of the problem. A distillation query is syntactically and semantically identical to a legitimate query. The signal isn't in individual requests. It's in the aggregate pattern across thousands of accounts and millions of interactions.

Three Tiers of Extraction

Anthropic's report describes increasingly sophisticated extraction techniques that map to different training objectives. Each tier extracts a different kind of value and needs a different detection approach.

Tier 1: Supervised Fine-Tuning Data
The baseline approach. Generate diverse prompts, collect high-quality responses, use the (input, output) pairs as training data. This is what the bulk of MiniMax's 13 million exchanges likely comprised — volume-oriented harvesting of agentic coding and tool-use responses. Detection signal: high volume, narrow capability focus, repetitive structural patterns across distributed accounts.

Tier 2: Chain-of-Thought Extraction
More targeted. Anthropic specifically called out DeepSeek prompts that asked Claude to "imagine and articulate the internal reasoning behind a completed response and write it out step by step." This isn't harvesting outputs — it's harvesting the reasoning process. The resulting data is more valuable because it captures intermediate reasoning steps, not just final answers. If you've followed the lineage from Chain-of-Thought Prompting through to process reward models, you know why this matters. Detection signal: prompts that consistently request step-by-step reasoning, explanation of decision processes, or verbalization of internal logic — especially at scale across coordinated accounts.

Tier 3: Reward Model Construction
The most sophisticated tier. DeepSeek used Claude for "rubric-based grading tasks" — using the target model as a reward model for reinforcement learning. They weren't extracting Claude's outputs for training data. They were extracting Claude's evaluative judgments as a training signal. This is efficient as hell: you need far fewer reward model samples than supervised training samples to meaningfully improve a model via RL. Detection signal: evaluation-style prompts, scoring rubrics, comparison tasks, and preference judgments at scale.

Each tier gets you more value per query. A well-designed campaign uses all three in sequence.

The Infrastructure: Hydra Clusters and Traffic Mixing

Anthropic describes the proxy infrastructure as "hydra clusters" — networks managing 20,000+ fraudulent accounts simultaneously across their API and third-party cloud platforms. Here's how they operate:

No single points of failure. Account bans are immediately backfilled.
Traffic mixing. Distillation queries are blended with legitimate customer traffic from the same proxy network, making behavioral isolation harder.
Multi-pathway access. Campaigns spanned multiple account types (educational, research, startup programs) to diversify their access surface.
Adaptive targeting. When Anthropic released a new model mid-campaign, MiniMax pivoted within 24 hours — redirecting roughly half their traffic to the updated system.

If you've worked botnet detection or large-scale scraping defense, this architecture is familiar. The novelty is the target, not the tactics. But detection is harder here because individual request payloads aren't anomalous — there's no SQLi signature, no malformed header, no obvious abuse pattern at the request level.

The Detection Engineering Problem

This is the most valuable part of the disclosure for practitioners. Anthropic describes building "classifiers and behavioral fingerprinting systems" for detection. Here's what that actually takes.

Feature engineering at the account-behavior level, not the request level. You need to aggregate across accounts and time windows to identify: topic concentration (is this account only hitting one narrow capability area?), structural repetition (are prompt templates being reused with variation?), and temporal coordination (are accounts exhibiting synchronized behavior?).

Cross-account correlation. The hydra cluster architecture means you need entity resolution across accounts that may share no obvious identifiers. Shared payment methods, timing patterns, prompt structural similarity, and infrastructure indicators (IP ranges, client fingerprints) become your linkage signals.

Distinguishing distillation from power users. A legitimate developer building an AI-powered product might generate high-volume, focused traffic that superficially resembles distillation. Your classifier needs features that capture the training data generation intent — prompt variation patterns that suggest systematic coverage of a capability space rather than production workload patterns.

Chain-of-thought elicitation detection. Anthropic mentions this specifically. Prompts that consistently request externalization of reasoning processes, especially when the structure suggests the output is being collected for training rather than being consumed by an end user.

The false positive problem is real. Legitimate evaluation and benchmarking, red-teaming, and research use can all look like distillation at certain scales. Any detection system here needs careful tuning to avoid punishing your heaviest legitimate users.

Defensive Countermeasures and Their Tradeoffs

Anthropic mentions "model-level safeguards designed to reduce the efficacy of model outputs for illicit distillation, without degrading the experience for legitimate customers." They don't get specific, but here's what that likely means:

Output perturbation. Injecting subtle noise into outputs that degrades their utility as training data without being noticeable to humans. Tradeoff: any perturbation that hurts training utility can also hurt downstream applications that depend on deterministic or consistent model behavior.

Watermarking. Embedding statistical signatures in model outputs that can be detected in models trained on those outputs. Kirchenbauer et al. and subsequent work showed promise, but also demonstrated that watermarks can be removed or diluted through post-processing. Works against casual distillation. Probably not enough against actors at this level.

Selective capability gating. Restricting access to the model's most valuable capabilities (extended reasoning, tool use, agentic behaviors) based on account trust level. Zero-trust applied to model capabilities — you earn access to higher-value outputs through demonstrated legitimate use. Tradeoff: friction on legitimate onboarding, which is exactly the pathway these attackers exploited.

Reasoning trace obfuscation. If chain-of-thought extraction is a primary vector, you can modify how the model exposes its reasoning — summarizing instead of showing step-by-step traces, or varying the structure of reasoning outputs to reduce their consistency as training data. Tradeoff: reasoning transparency is a feature, not a bug. A lot of legitimate users are paying for exactly this.

None of these are silver bullets. The core problem: the same properties that make model outputs valuable to legitimate users — quality, consistency, reasoning depth — make them valuable as training data. Any defense that degrades training utility is going to degrade product utility too. That's the tradeoff nobody's solved.

What This Means If You're Building

If you're exposing any model capability through an API — frontier lab or company running fine-tuned models for your domain — this is now a documented threat pattern.

AI vendor risk assessment needs a provenance question. If you're consuming AI capabilities from third-party providers, understanding how their models were trained is a security question now. A model built through illicit distillation may have had safety alignment degraded in the process. This isn't theoretical — Anthropic's report says directly that safety guardrails are unlikely to transfer faithfully through distillation.

MCP and agent ecosystems expand the extraction surface. As AI systems get more agentic — calling tools, executing code, orchestrating multi-step workflows — the capability surface available for distillation grows. Moonshot and MiniMax specifically targeted agentic reasoning and tool use. Any trust framework for agent-to-agent or agent-to-service communication (like MCP) needs to account for the possibility that one endpoint in the chain is conducting capability extraction rather than legitimate interaction. This is the supply chain trust problem applied to model intelligence.

Rate limiting is necessary but not sufficient. Per-account rate limits are trivially defeated by hydra cluster architecture. Behavioral rate limiting — throttling based on detected extraction patterns rather than raw volume — is closer to what's needed, but that requires the detection engineering investment described above.

This is an arms race. MiniMax pivoting to a new model release within 24 hours tells you these campaigns adapt in real time. Static defenses will get outpaced. This needs the same continuous detection and response investment we'd apply to any sophisticated threat actor. Treat it like one.

What Anthropic Didn't Say

Worth flagging a few gaps.

The report doesn't address whether distillation was detected in real-time or through retrospective analysis. The MiniMax campaign was caught "while it was still active," but the DeepSeek and Moonshot timelines are less clear. That distinction matters a lot: real-time detection enables intervention. Retrospective analysis gives you attribution but the horse has already left.

There's no discussion of whether extracted capabilities were actually confirmed in the resulting models. Anthropic draws the connection between distillation campaigns and the labs' product roadmaps, but proving that specific capabilities in DeepSeek V4 or Kimi originated from Claude distillation is a different problem entirely — you'd need model output comparison, behavioral fingerprinting of deployed models, or watermark detection. That's the smoking gun they don't have yet, at least not publicly.

And the report is silent on distillation from non-Chinese actors. This is almost certainly happening — distillation is a technique, not a nationality — but only campaigns attributed to Chinese labs made the cut. Understandable given the policy context and the export control debate, but incomplete as a threat picture. If you're building defenses based on this report, don't scope them to one country of origin.

Phil Stafford is an AI security researcher and Principal Consultant at Singularity Systems. He builds tools for securing AI agent ecosystems, including ThinkTank (multi-agent structured dissent for security analysis) and Credence (cryptographic trust registry for MCP server validation). He writes about AI security on Medium and speaks on adversarial AI and agent security at industry conferences.

Someone Cloned an Oura Ring MCP Server and Poisoned the Supply Chain. We Can Fix This.

Phil Stafford — Tue, 24 Feb 2026 20:48:31 +0000

The attack didn’t exploit a vulnerability. It exploited the fact that nobody’s checking who actually wrote the tools we’re installing.

by Phil Stafford

Note: This is a reprinting of an article I published in Medium on Feb. 18, 2026.

On February 5th, Straiker’s STAR Labs team dropped research that made me sit up straight. A supply chain attack against the MCP ecosystem. Not a smash-and-grab. This one was patient. Months of setup, completely invisible until Straiker caught it.

Not a zero-day. Not some new class of exploit. Something much older and much dumber: fake it till you make it, applied to malware distribution. A threat actor cloned a legitimate MCP server, built a fake GitHub ecosystem around it, and got it listed on MCP Market. A developer searching for an Oura Ring integration would have found it, seen the forks, seen the contributors, and installed it without thinking twice.

And it would have stolen everything on their machine.

The download looked legit. The server works perfectly. The raccoon was very polite.

The Playbook

The target was Tomek Korbak’s Oura Ring MCP server. Connects your AI assistant to your health data, sleep scores, readiness metrics. Korbak works at OpenAI. Legit project. Exactly the kind of thing a developer who tracks their HRV and sleep stages would install before breakfast.

SmartLoader — a malware operation that used to distribute infostealers through pirated software — saw an opportunity. Developer workstations are treasure chests. API keys, cloud credentials, SSH keys, crypto wallets, production access. Why bother phishing when you can get developers to install your code voluntarily?

Straiker’s research (credit where it’s due, they did the detective work here) documents the whole operation:

Phase 1 — Target selection. Pick a server that appeals to developers specifically. Health optimization tools. Sleep tracking. The Oura Ring crowd. These people have AWS keys and crypto wallets sitting on the same machine.

Phase 2 — Build the ecosystem. A primary account, YuzeHao2023, creates a clean fork. Four more accounts fork from that. Instant appearance of organic community interest. The accounts are obviously fake if you know what to look for: recent creation dates, similar activity patterns, commits clustered in the same timeframes. But who looks? They also forked other projects from YuzeHao2023, creating a web of cross-references so each account looks more established. This took months.

Phase 3 — Deploy the payload. A new account, SiddhiBagul, creates the trojanized version. Source code matches the original. Documentation is complete. Contributor list includes the fake accounts. And they did not include Tomek Korbak, the actual author.

Straiker called this the smoking gun. A legitimate fork would credit the original creator. The deliberate exclusion confirms a single threat actor running the whole show.

Phase 4 — Registry poisoning. Submit the trojanized server to MCP Market. That’s it. That’s the whole barrier to entry. It gets listed alongside legitimate tools and nobody asks who actually wrote it.

The payload was a resource.txt file containing heavily obfuscated LuaJIT that deployed StealC. Browser passwords. Discord tokens. Crypto wallets. Cloud session tokens. SSH keys. The works.

The persistence mechanism was a nice touch, too: scheduled tasks masquerading as Realtek audio drivers. Every SOC analyst on earth is trained to ignore Realtek processes. That’s not even hacking at that point. That’s just knowing how tired your adversary is.

“Hey Bob, come on in.”

The Tooling Gap

Here’s what gets me about this attack: the MCP ecosystem doesn’t have the tooling to catch it. Not “didn’t have.” Doesn’t have. Present tense.

Think about what a developer actually sees when they’re evaluating this server. Code works. Documentation looks fine. Forks exist. Contributors exist. Source matches the original. Every signal we tell developers to check (stars, forks, contributor count, documentation quality) was fake. Every single one.

Stars can be bought for pocket change. Forks are free. And the MCP ecosystem is still in its “HTTP before TLS” phase, with the protocol growing way faster than its security story. The spec itself says tool descriptions “should be considered untrusted, unless obtained from a trusted server.” Great. So how does a developer know if a server is trusted? Right now? They don’t.

The MCP ecosystem lacks the security infrastructure that has developed around traditional package managers. There is no equivalent to npm audit, Dependabot, or Snyk for MCP servers. — Straiker report

The MCP Registry is a metadata catalog, and that’s appropriate for this stage of the ecosystem. Community moderation catches obvious malware. But SmartLoader didn’t deploy obvious malware. They deployed a perfectly functional Oura Ring integration that also stole your credentials. That’s a harder problem, and it requires different tooling.

Straiker’s recommendation? “Verify provenance deeply” and “check account creation dates.” Sure. That’s good advice if you have the time and discipline to do it for every server you install. Nobody does.

We’ve solved this before. npm did it. Docker did it. Sigstore, SBOMs, provenance attestations. The supply chain security stack exists. It just doesn’t reach AI tools yet.

“The free lemonade is a nice touch.”

How we fix this.

I’ve been building something for this. Credence is a cryptographic attestation system for AI tools: MCP servers, OpenClaw skills, Claude Desktop extensions (soon), and whatever comes next as the ecosystem evolves. I wrote about this class of attack in “Poisoned Pipelines” on my Medium page a few weeks ago. The SmartLoader/Oura incident is basically the proof of concept I was hoping wouldn’t show up this fast.

I want to be specific about how Credence addresses this, because vague claims about “trust” aren’t useful when the attack chain is this concrete.

Source code fingerprinting. Credence hashes every file in the codebase with SHA-256 and rolls those hashes into a single deterministic fingerprint, pinned to the exact git commit. That fingerprint becomes part of a signed attestation covering the score, verdict, and authorship data. You want to install a server? Hash the code yourself and compare. If they don’t match, the code changed since we analyzed it. Walk away. In the SmartLoader case, the trojanized version with resource.txt added would produce a completely different hash. Instant red flag.

SmartLoader’s source code actually matched the original for the most part, though. The payload was in the release archive, not the repo source. So source hashing alone isn’t enough.

Author identity binding. Credence pulls the claimed author from package.json, pyproject.toml, the git remote, and the GitHub API. For forks, it cross-checks the repo owner against the package metadata author and checks whether the original author was kept in the contributor list.

SiddhiBagul/MCP-oura: repo owner is SiddhiBagul. Package author is Tomek Korbak. Mismatch on a fork. Credence records it. Does the original author appear in the fork’s contributors? No. Because SmartLoader deliberately cut Korbak out.

That combination (fork, original author excluded, recently created account) is not ambiguous. That’s a supply chain attack profile. Credence would light up like a Christmas tree.

Adversarial AI analysis. Credence doesn’t just run scanners and hand you the output. Five AI agents with different security mandates argue about what the findings actually mean. A skeptic agent trained to look for supply chain attack patterns looks at those provenance flags and constructs the worst-case scenario. See my previous articles and my presentation at MLOps’ Agents in Production conference.

Most SAST tools don’t have rules for “obfuscated Lua bytecode loaded from a text file.” That payload would sail right through Semgrep and Bandit. But the provenance signals alone (identity mismatch, excluded original author, brand-new account, fork with a mystery payload file) would be enough for the skeptic agent to argue rejection. That’s what the debate gives you that static tools can’t: the ability to look at a stack of individually-iffy signals and say “no, taken together, this is an attack.”

The registry. With Credence in the picture, a registry like MCP Market would have had actual data to work with. Not just “is this server listed” but “who wrote it, does that check out, what’s the trust score, and can you verify any of this cryptographically?”

SiddhiBagul/MCP-oura either wouldn’t have a Credence attestation at all (which is itself a signal) or it’d have one with a low trust score and a pile of provenance warnings. Either way, the developer has information instead of vibes.

Beyond MCP servers. The AI tool ecosystem is growing fast and in multiple directions. OpenClaw skills, Claude Desktop extensions, and whatever comes after them all share the same supply chain trust problem. Credence already covers OpenClaw skills in the registry, using the same scanning pipeline and attestation model, and we’re adding new tool types as they emerge. The attack surface isn’t limited to MCP servers, and the verification layer shouldn’t be either.

What This Doesn’t Solve

I’d rather you hear the limitations from me than figure them out in a postmortem.

Credence is install-time only. It tells you whether to trust a server before you run it. Once you install it, you’re on your own. Credence doesn’t monitor runtime behavior. If a legitimate server gets compromised six months after attestation, Credence won’t catch that. Indirect prompt injection, cross-server orchestration attacks. Different problems, different tools.

Runtime enforcement is its own problem and other people are working on it: Docker’s MCP Catalog, ToolHive, Solo.io’s Agent Mesh, Acuvity’s runtime guardrails. Credence is complementary. We tell you what to trust before install. They keep an eye on it after.

And yeah, a determined attacker could submit their trojanized server to Credence itself for analysis. I can’t stop that. But the attestation would carry their identity, not the original author’s. The provenance flags would still fire. The deliberation would still flag it. You can’t game the system without leaving fingerprints, and Credence is specifically designed to look for fingerprints.

Five Accounts and Some Patience

MCP prioritized capability and adoption first, and that was the right call. You have to ship before you can secure. But right now, the moment you decide to install an MCP server is basically your entire security boundary. Stdio transport has no authentication — that’s by design, not a bug. So the install decision is it. And we’re making that decision based on GitHub stars and README quality.

SmartLoader proved those signals can be manufactured in three months with five fake accounts.

That’s the current cost of breaching the MCP supply chain. Five accounts and some patience.

Straiker caught this one. Their STAR Labs team did excellent work tracing the infrastructure, attributing the campaign, documenting the kill chain. But their own report says it plainly: “The MCP ecosystem lacks the security infrastructure that has developed around traditional package managers. There is no equivalent to npm audit, Dependabot, or Snyk for MCP servers.”

That’s the gap. Credence is built to fill it. Not with more social signals that can be manufactured, but with cryptographic attestation: source fingerprints, verified authorship, adversarial analysis that actually argues about what the findings mean.

The next SmartLoader won’t target a sleep tracker. It’ll go after a database connector, or a deployment tool, or something that touches your CI pipeline. And the playbook is public now. Next time it won’t take three months.

We need the verification layer before that happens.

I’m building it. It’s called Credence. The registry, scanning pipeline, and client tools are open source: credence.securingthesingularity.com

Running MCP servers? Check your setup. Building one? Submit for a scan.

Phil Stafford is a cybersecurity consultant at Singularity Systems in the San Francisco Bay Area. He’s currently building Credence, a cryptographic trust registry for AI tools. When he’s not yelling about supply chain security, he’s a musician and artist making art in a post-AI world.

Straiker’s full research report on the SmartLoader/Oura Ring attack is available at straiker.ai/blog.