Forem: Sahil Singh

Why urlparse() isn't a guard

Sahil Singh — Thu, 16 Apr 2026 10:00:10 +0000

Why `urlparse()` isn't a guard

A lot of code looks like this:

def fetch_tool(url: str) -> str:
    parsed = urlparse(url)
    return httpx.get(url).text

The author parsed the URL, so the URL is validated. Right?

No. urlparse() is a parser. It tells you what the pieces of a URL are. It does not tell you whether you should fetch it. If url is http://169.254.169.254/latest/meta-data/, urlparse() returns a perfectly valid ParseResult and httpx.get() cheerfully fetches AWS metadata credentials from inside your VPC.

This is the SSRF class of bug. It's boring. It's also the thing that keeps showing up in MCP servers — tools that accept a URL, fetch it server-side, return the body to the model. The model decides what URL to fetch based on untrusted input (a prompt, a doc, a tool response). So the URL is attacker-controlled by construction.

When we wrote the SSRF check for mcp-scan (MCPA-060), the hard part wasn't finding httpx.get(url). The hard part was deciding what counts as a guard. I want to walk through that decision, because the answer is narrower than most people expect and it changes how you write the fix.

What the check actually flags

The check triggers on HTTP fetch calls (httpx.get, requests.post, urllib.request.urlopen, etc.) where:

The URL argument is a variable, not a string literal.
The enclosing function has no recognized host validation tied to that variable.

A string literal like httpx.get("https://api.github.com/user") is fine — the developer hardcoded the host. A variable URL with no guard is not fine. The interesting question is the second condition: what is a "recognized guard"?

Accepted: hostname membership against a trusted collection

The primary pattern the check accepts:

def fetch_tool(url: str) -> str:
    parsed = urlparse(url)
    if parsed.hostname not in ALLOWED_HOSTS:
        raise ValueError("host not allowed")
    return httpx.get(url).text

Three things have to be true for this to count:

The URL variable flows into urlparse() or urlsplit() and the result is bound to a name.
result.hostname or result.netloc appears in a Compare node with in or not in.
The other side of the comparison is a trusted collection.

That last bullet is where most of the logic lives.

What counts as a trusted collection

Three things are accepted as the container side of the membership test:

A literal. parsed.hostname in {"api.example.com", "api.stripe.com"}. The allowlist is right there in the source. Nothing ambiguous.

A local name whose every assignment is a literal collection. If a function does ALLOWED = {"host1"}; if debug: ALLOWED = {"host1", "host2"}, both branches assign literals, so the name is trusted. If any branch assigns from a non-literal (ALLOWED = load_from_request(request)), the name is rejected — fail closed.

A bare name that is not assigned locally and is not a parameter. This is the module-scope case: ALLOWED_HOSTS = {...} at the top of the file, referenced from inside the function. The check trusts this because module-scope names are almost always developer-controlled constants. It's trust-based. ALLOWED_HOSTS = load_policy_from_env() at module scope would false-clean. Fixing that honestly would require whole-file analysis, which is out of scope for a check that runs in seconds.

What doesn't count

This is where it gets interesting, because the rejections are the part that most linters and security tools get wrong.

Function parameters are rejected. If someone writes:

def fetch_tool(url: str, allowed_hosts: set[str]) -> str:
    parsed = urlparse(url)
    if parsed.hostname not in allowed_hosts:
        raise ValueError("no")
    return httpx.get(url).text

The check fires. Why? Because allowed_hosts is attacker-controlled by construction — the caller passes it in. In an MCP server, the caller is usually the model, and the model is reading attacker input. A "guard" that reads its allowlist from the same context that chose the URL is not a guard. The check explicitly collects every parameter (positional, keyword-only, *args, **kwargs) and refuses to trust any of them as a container.

Equality is rejected. parsed.hostname == "api.example.com" is not accepted, only in / not in. Equality against a single literal is technically safe, but it collapses into a pattern that's hard to distinguish from garbage like parsed.scheme == "https" (which isn't a host guard at all). Narrowing the check to membership against a collection makes the accept rule cleanly describable. If you have a one-host allowlist, write in {"api.example.com"}. It reads better anyway.

Attribute chains are rejected. parsed.hostname in request.headers["X-Allowed"] gets flagged. The container lives in request state, which is attacker-controllable or at least not statically verifiable.

DNS resolution alone is rejected. Calling socket.gethostbyname(host) without inspecting the result proves nothing. An attacker can DNS-rebind or point at an internal IP. The check doesn't treat "we looked up the name" as validation — only "we compared the result to a trusted set" counts.

Two secondary patterns

Two other patterns the check accepts:

ipaddress family checks on a URL-derived attribute:

parsed = urlparse(url)
if ipaddress.ip_address(parsed.hostname).is_private:
    raise ValueError("no private")

Specifically, the check looks for a call to a method named is_private, is_loopback, or is_reserved where the argument is parsed.hostname or parsed.netloc. This is narrower than it could be — ipaddress.ip_address(parsed.hostname).is_private requires tracking the intermediate object, which is multi-hop dataflow. We don't do that. If you write it as checker.is_private(parsed.hostname) with the hostname passed directly, we catch it. If you chain it through an intermediate object, we miss the guard and false-positive. That's a documented limitation.

Helper-name guards:

if not validate_url(url):
    raise ValueError("no")
return httpx.get(url).text

Calls to functions named validate_url, check_url, allowed_host, or is_allowed with the URL variable as an argument are trusted. This is the most generous of the three patterns — the check has no idea what validate_url actually does. It could be return True. But false-positives on URL handling code with custom validators were painful enough in testing that we accept the heuristic and document it.

The four honest limitations

If you read the check's docstring in source_code.py, you'll see four limitations called out explicitly:

Single-hop dataflow. We trace url → urlparse(url) → parsed.hostname. We don't trace through intermediate variables beyond that. host = parsed.hostname; if host in ALLOWED would miss.
Helper-name trust. validate_url(url) is accepted without looking inside the helper. A badly-named no-op would false-clean.
Module-scope trust. A module-level name is assumed to be a developer-controlled constant. Dynamic globals break this.
No DNS resolution as guard. We don't accept name resolution as a stand-in for policy enforcement. (This is actually correct — but it means tools that claim to guard via DNS are flagged.)

These are in the check's description string. They ship with every finding. That matters because "here's a false positive" is a different conversation than "here's an undocumented gap in the tool."

What this changes about how you write the fix

If you were going to patch your MCP server's SSRF exposure, the version that passes mcp-scan looks like:

ALLOWED_HOSTS: set[str] = {"api.example.com", "api.stripe.com"}

def fetch_tool(url: str) -> str:
    parsed = urlparse(url)
    if parsed.hostname not in ALLOWED_HOSTS:
        raise ValueError("host not allowed")
    return httpx.get(url).text

Module-scope literal set. Membership test. hostname attribute of a urlparse result. Every piece maps to a rule the check understands.

The version that looks like a fix but doesn't pass:

def fetch_tool(url: str, allowed: set[str]) -> str:
    parsed = urlparse(url)
    if parsed.hostname not in allowed:
        raise ValueError("host not allowed")
    return httpx.get(url).text

Same shape. Parameter instead of module constant. Check fires — because the tool is right. allowed is whatever the caller passed. In an agent context, the caller is the model, and the model reads attacker input.

The meta-point

Security checks that report "URL fetch without validation" don't give you a remediation. They give you a vibe. A developer who reads the finding and adds urlparse() has done nothing and the tool has no way to tell them.

The useful version of the check has to commit to a position on what counts. That commitment is the hard part. You'll be wrong sometimes — a valid guard using an AST shape you didn't anticipate, or a module-scope name that turns out to be dynamic. You'll false-positive real code and false-clean bad code. The discipline is documenting the shape you accept, documenting the shape you reject, and letting a developer read the check and understand why their code was flagged.

urlparse() isn't a guard. Neither is validate(). Neither is if host:. The guard is a membership test against a collection whose contents you control.

mcp-scan is an open-source AST-level security scanner for MCP servers. The SSRF check discussed here is MCPA-060. If you run MCP tools in production and want the check to run on your source, pip install mcp-scan and point it at your repo.

We built an open-source security scanner for MCP servers

Sahil Singh — Mon, 13 Apr 2026 03:39:42 +0000

MCP servers are the new attack surface. Every agent that mounts a GitHub MCP server, a filesystem MCP server, or a custom tool server is trusting that server's tool descriptions, input schemas, and handler code. Most of that trust is misplaced.

We built mcp-scan — an open-source CLI that connects to an MCP server via stdio, introspects its tool manifest, and runs deterministic security checks against both the protocol metadata and the Python source code.

Install: pip install velox-mcp-scan
Repo: github.com/veloxlabsio/mcp-scan
Demo page: veloxlabs.dev/mcp-scan

What it catches

6 checks ship today. Two work against any MCP server (no source access needed). Four require pointing at the server's Python source with --source.

Protocol-level:

MCPA-001 (Critical) — Prompt-injection markers in tool descriptions. Imperative verbs, <system> tags, exfiltration phrases. This is the Trail of Bits "line jumping" attack — the payload fires when the agent connects, before any tool call.
MCPA-002 (High) — ANSI escapes, C0 control chars, zero-width characters hiding payloads in descriptions. The terminal renders them invisible; the LLM reads them as instructions.

Source-code AST:

MCPA-010 (Critical) — Path traversal in file handlers. Flags open() / read_text() on user-derived paths without is_relative_to() containment. resolve() alone is not sufficient — that's the EscapeRoute bypass (CVE-2025-53109/53110).
MCPA-012 (Critical) — Shell injection. subprocess with shell=True and dynamic command strings. Catches the pattern from CVE-2025-68144 (Anthropic Git MCP).
MCPA-060 (High) — SSRF sinks. HTTP client calls (httpx, requests, urllib) with variable URLs. The guard detection does lightweight dataflow tracking — urlparse(url) alone doesn't count as validation. It traces the URL variable through parse results to hostname membership checks against trusted collections.
MCPA-070 (High) — Hardcoded secrets. Known prefixes (sk-, ghp_, AKIA, xoxb-) and high-entropy strings in secret-named variables.

19 more checks are planned for v0.1 — see docs/checks.md for the full roadmap.

Design decisions that matter

Fail-closed. If introspection fails — server hangs, tools/list errors, timeout — the scanner produces a CRITICAL finding, not an empty clean report. A security tool that silently passes on errors is worse than no tool.

No LLM required. Every check is deterministic AST analysis and pattern matching. No API keys, no cloud calls, no probabilistic scoring. Runs fully offline in ~2 seconds.

Capability-aware introspection. If a server only advertises tools (not resources or prompts), a failed resources/list call is informational, not critical. The scanner respects what the server actually claims to support.

Dataflow-tracked SSRF guards. This was the hardest check to get right. Early versions counted urlparse() anywhere in the function as "guarded." Six review rounds later, the detection traces url → urlparse(url) → parsed → parsed.hostname in ALLOWED_HOSTS and validates that the membership target is a trusted collection (literal set, all-literal local variable, or module-scope constant). Equality comparisons (==) are rejected — only in / not in count. Local aliases of function parameters or attributes are rejected. Any non-literal assignment in any branch poisons a local name.

Four limitations are honestly documented: single-hop tracking, trust-based helper names (validate_url(url)), module-scope over-trust, and no DNS resolution as a guard pattern.

Try it in 30 seconds

The repo ships with vulnerable-mcp — a deliberately broken MCP server with 5 planted vulnerabilities. The scanner catches all 5 (7 findings total, zero false positives).

pip install velox-mcp-scan

# Protocol-level only (2 findings)
mcp-scan scan --stdio "python3 -m vulnerable_mcp.server"

# Protocol + source (7 findings, all 5 vulns caught)
mcp-scan scan --stdio "python3 -m vulnerable_mcp.server" --source ./vulnerable_mcp

Output formats: terminal (default), JSON (-f json), Markdown (-f markdown). Non-zero exit on findings — drop it into CI as a gate.

What's next

MCPA-020 — Curated MCP dependency CVE matching
MCPA-061 — Markdown image / auto-link exfiltration vector detection
HTTP/SSE transport — --url for remote servers
OAuth 2.1 DCR flow auditing

Why we built this

Security tools that ship without a reference vulnerable target are hard to evaluate. We wanted something you could install, point at a deliberately broken server, and see exactly what it catches — on your own laptop, in 30 seconds, without connecting to anything real.

MCP is early. The default configs are copy-pasted from examples. There's no equivalent of SELinux for tool-use permissions yet. The window to harden these before they become the 2027 supply-chain story is open right now.

Built by Velox Labs — AI Security & Platform Engineering.

Links: