Forem: Guy

Code Mode for MCP: The Long-Tail Escape Hatch, Not the Front Door

Guy — Mon, 27 Apr 2026 20:25:45 +0000

Your MCP server has 12 good tools. Business users are getting value. Then someone asks for a request you did not package: "Find the top 20 customers whose spend fell for three consecutive months, compare them to support ticket volume, and give me only the accounts that have not been contacted in the last 14 days."

That question hides a join, a window function, a filter, and a ranking — and your backend is a database that already knows how to run exactly that. The LLM could try to chain five tools, burn tokens on intermediate results, and still get the arithmetic wrong. Or you could let it write the single query the database is designed for, and let the database do the work.

The misdirection is to add five more tools, then ten more, then eventually wrap the whole database or API. That is how teams end up with bloated MCP surfaces, low task completion due to LLM confusion, and a security boundary they do not really understand.

The right direction is code mode, but only as a controlled extension of the design work you already did in the earlier articles. Code mode is a long-tail escape hatch for data-system interfaces, not a replacement for curated tools, prompts, and resources.

In the PMCP SDK, that boundary is intentionally narrow: two tools (validate_code and execute_code), policy evaluation, approval tokens, and optional human approval. The narrowness is the point. It expands coverage without turning the server into an ungoverned remote shell for your database or API estate.

The recent enthusiasm around code mode is well earned. Cloudflare argues that "LLMs are better at writing code to call MCP", and Anthropic shows how code execution can reduce token usage from 150,000 to 2,000 tokens in a Google Drive to Salesforce workflow while improving composition efficiency source. Anthropic also notes that the benefits "should be weighed against these implementation costs". That is exactly the gap this article addresses. Those articles make the upside clear. This article focuses on the balancing framework they mostly leave implicit: why code mode belongs on top of curated tools, why sandboxing is not the primary security boundary for data systems, how to give the LLM enough power to handle long-tail requests quickly, and who inside the organization actually owns the levers that keep all of that safe.

What Is MCP? (The 30-Second Version)

The Model Context Protocol (MCP, spec 2025-11-25) defines three primitives for connecting AI models to external services: tools, prompts, and resources. The first article covered tools, the second article covered prompts and resources, and the third article covered how to test a server that uses all three. This article covers code mode, the controlled execution pattern that extends those three primitives when a request falls outside the scope of any curated tool.

The enterprise mental model is the same one from the earlier articles. MCP for AI is what HTTP-based applications are for humans. An MCP server is the AI-facing interface to your organization's internal systems, just as a web server or mobile app is the human-facing interface to those same systems. In practice, those internal systems are usually data systems: a SQL database, a REST API described by OpenAPI schema, or a GraphQL service, and the MCP server is a thin, remote, mostly stateless interface layer in front of them.

That framing matters here because code mode raises the stakes. If MCP is your AI-facing interface, then code mode is the part of that interface where the client is asking to send a program, query, or execution plan rather than a fixed tool call. You should therefore treat it like any other security-sensitive interface surface: explicit contracts, least privilege, strong policy enforcement, auditability, and safe defaults.

The operating model from the previous articles still applies: domain-led, engineering-implemented, platform-governed. Code mode, as we will see, is where the "platform-governed" part of that model gets more teeth.

Code Mode Is A Data-System Problem

Most production MCP servers sit in front of one of the following kinds of backend:

a relational database — typically MySQL, PostgreSQL, Oracle, Snowflake, BigQuery, or a cloud warehouse.
a REST API described by an OpenAPI spec — a system of record, a ticketing system, a CRM, a cost and billing service.
a GraphQL API — Invented by Facebook to simplify API access for mobile devices, and grew to wrap different API in a consistent query language.

There are other interfaces, such as MongoDB and Elasticsearch, that can have a similar code-mode layer; however, they are not covered in this article.

These backends already have expressive query languages that solve exactly the problems the LLM is weakest at: joins, aggregations, filters, sorting, windowing, selective field projection, and batched calls. When a business user asks a long-tail analytical question, the right execution engine is almost always the backend itself, not the LLM choreographing a chain of ordinary tool calls.

That is why PMCP's code mode is organized around three interpreter modes that line up with these three backend shapes:

SQL code mode pushes joins and window functions into the database, where they are cheap and correct.
OpenAPI code mode (JavaScript) lets the LLM orchestrate several REST calls server-side, with one approval step instead of many tool rounds.
GraphQL code mode lets the LLM request precisely the fields and edges it needs in one round trip instead of a deep chain of field-by-field reads.

In all three cases, the MCP server stays thin. It does not become the database or the API. It becomes a validated gateway into them, and code mode is the narrow slot through which a long-tail request passes on its way to the right execution engine.

The Capability Pentagon: A New Corner For Code Mode

The first article introduced the Capability Square: four parties around every MCP tool — the Business Analyst who designs the server, the Business User who invokes it, the LLM inside the MCP client that interprets intent, and the MCP Server that executes deterministically. The second article showed how prompts sit across the two human corners of that square. For code mode, the Square is not quite enough.

Code mode is not just another primitive. It is an exposed execution surface that grants the LLM more power over your data system than any single tool does. Tools are fixed contracts. Code mode is a narrow programming interface whose allowed behavior depends on continuously maintained policies rather than code baked in at design time. That shift in shape requires a party that was implicit in the earlier articles to become explicit: the IT Administrator.

The Pentagon has the four corners you already know, plus one new corner that becomes load-bearing the moment code mode is enabled:

Business Analyst (design-time domain expert). Decides which operations are named and exposed, which risk levels can auto-approve, which output fields are blocked, and which roles should see which slice of the data system. Same corner as before; now with code mode–specific choices.
Business User (runtime domain expert). Brings the specific question inside the specific business context. Can now ask questions that go beyond the curated tool set, but only within the policy set by the IT Administrator.
LLM / MCP Client (runtime intent and code generation). Translates the user's request into a bounded SQL statement, JavaScript execution plan, or GraphQL operation. Works from the schema and instructions the server publishes, not from guesses about the backend.
MCP Server (runtime execution boundary). Validates the code, classifies the action, computes a risk level, signs an approval token, and executes only the exact validated code against least-privilege backend credentials.
IT Administrator (continuous governance). Owns the policy surface: the Cedar (or AVP, or custom) policies that decide whether a particular user in a particular role may perform a particular action against a particular server configuration. Also owns the signing secret, token lifetime, execution limits, allow and block lists, and the approval workflow. This corner is where security posture is tuned over time rather than only specified up front.

Pentagon Corner	When They Act	What They Bring To Code Mode
Business Analyst	Design-time	Which operations are named; which risk levels auto-approve; which fields are blocked
Business User	Runtime	The long-tail request that goes beyond curated tools
LLM / MCP Client	Runtime	The generated SQL, JavaScript, or GraphQL code
MCP Server	Runtime	Validation, action classification, signing, bounded execution
IT Administrator	Continuous governance	Cedar/AVP policies, signing secrets, role scoping, approval workflow, audit review

Tools never really needed an explicit IT Administrator corner because the allowed behaviors were baked into the tool schema at design time. Code mode is different. Remove the IT Administrator from the picture, and you either lock code mode down so hard that nobody uses it, or you leave it open enough that a single bad policy change becomes a production incident.

A useful way to read the Pentagon:

The two human corners (Analyst, User) provide the domain
The two machine corners (LLM, Server) provide the execution
The IT Administrator corner provides the continuous governance that makes code mode safe to leave on

If any corner is weak, code mode fails in a characteristic way. Weak Analyst: The exposed operation surface does not match the real requests. Weak User: nobody triggers anything, and code mode is unused. Weak LLM: generated code drifts outside the schema. Weak Server: validation is a rubber stamp. Weak IT Administrator: The policy is wrong, stale, or the same for every role.

Code Mode Is Additive, Not Foundational

The easiest way to misuse code mode is to treat it as the foundation of the server rather than the long-tail extension. That usually sounds like one of these:

"Why not just expose the whole OpenAPI spec and let the model script against it?"
"Why not give it direct SQL access and let it figure things out?"
"Why bother designing tools if code mode can do anything?"

Because the whole point of the earlier design work was to remove unnecessary choice and unnecessary risk. For the common tasks, dedicated tools still win:

They are easier for the LLM to select correctly.
They are easier for the business analyst to describe in domain language.
They are easier to test, benchmark, and audit.
They are easier to secure because the allowed behavior is explicit.

Prompts still win for repeatable workflows:

They encode recurring business processes once.
They move deterministic orchestration to the server side.
They reduce model failure modes on multi-step work.

Resources still win for the governed context:

They let you publish instructions, schemas, templates, and policy summaries.
They reduce guessing before the model writes code.

Code mode exists for what remains after you have done those three things well. That is the design stack:

Curated tools for the common high-frequency tasks.
Prompts and resources for known workflows and governed context.
Code mode for the long tail.

If you invert that stack and start with code mode, you are pushing core design responsibility into the runtime behavior of an LLM that does not understand your business, your controls, or your risk tolerance.

It is much easier for the LLM to choose a well-defined tool and use it correctly than to write perfect code to achieve the same goal.

The Real Threat Model: Assume A Hostile Client

Most discussions of naive code mode assume a cooperative client. That is not a serious production threat model. You should assume the MCP client can be compromised, misconfigured, prompt-injected, or simply wrong. The server must therefore treat every validate_code request as untrusted input and every execute_code request as an attempted privileged action.

This has a practical consequence: sandboxing is not the primary security boundary for code mode.

A runtime sandbox can be useful as a defense-in-depth for the interpreter itself. But if the code is allowed to run a dangerous database statement or call a destructive API operation, the damage is already authorized at the business system boundary. No sandbox around the interpreter fixes that. For real data systems, such as databases, OpenAPI-backed services, and GraphQL APIs, the primary controls have to be:

restricted schema exposure, as designed by the business analyst.
explicit operation categorization, which can be deduced from the data system schema syntax (GET vs. PUT, SELECT vs. UPDATE, etc.), and can be overridden by the business analyst.
policy evaluation against business actions
cryptographic binding between what was validated and what is executed
optional human approval for higher-risk actions, which is optional as the MCP client might ignore the risk level and continue to the execute_code call without human approval.
least-privilege credentials on the downstream systems, with the business user's OAuth access token permissions, as discussed in the security article.

That is the PMCP design point. Code mode is not "run arbitrary code safely in a sandbox." It is "validate a bounded program against a bounded policy surface, then execute only the exact approved code against bounded downstream permissions." Most of those controls are owned by the IT Administrator's corner of the Pentagon, which is why that corner becomes load-bearing the moment code mode is turned on.

The PMCP Security Envelope

The PMCP SDK keeps the code mode interface intentionally small:

validate_code
execute_code

Everything else happens behind those two tools.

LLM reads the bounded schema and policy context
        |
        v
validate_code(code)
        |
        +--> parse and classify the code
        +--> analyze data access, complexity, and action type
        +--> evaluate policy (Cedar, AVP, or custom)
        +--> generate business-language explanation
        +--> issue HMAC-signed approval token
        |
        v
optional human approval
        |
        v
execute_code(code, approval_token)
        |
        +--> verify signature
        +--> verify code hash
        +--> verify expiry
        +--> verify user/session/context binding
        +--> execute through the server-side execution layer

Layer 1: Two-step execution.

The client cannot jump straight to execution. The PMCP handler exposes validate_code and execute_code separately, and the generated tool descriptions explicitly tell the model that validate_code must come first.

Layer 2: Policy evaluation during validation.

The validation pipeline parses the code, classifies the action, analyzes access patterns, and calls the server's authorization backend before any approval token is issued.

Layer 3: Approval tokens bind code to context.

PMCP uses HMAC-signed approval tokens that bind the validated code hash to user ID, session ID, server ID, context hash, risk level, and expiry time. If the code changes after validation, execution is rejected.

Layer 4: Optional human approval.

Low-risk read operations can be auto-approved. Higher-risk operations can require explicit approval before execute_code is called.

Each layer compensates for a different failure mode:

policy evaluation answers "should this kind of operation be allowed?"
token binding answers "is this the exact code that was validated?"
human approval answers "does this high-risk action have accountable oversight?"

None of these layers is sufficient on its own. Together, they form a credible boundary.

Why The Token Matters

The approval token turns validation into a verifiable contract. Without token binding, an attacker could submit a harmless, read-only script to validate_code and obtain a positive validation result for that exact code, then alter the script and attempt to call execute_code with the modified version — an attack that token binding is specifically designed to prevent by tying the approval to the canonicalized code hash and contextual metadata so any mismatch is rejected at execution time.

PMCP prevents that by hashing the canonicalized code during validation and embedding that hash in the token. During execution, the code is hashed again and compared to the token's hash. The token also expires quickly by default and is tied to the user, session, and validation context.

In the SDK, the token binds at least these elements:

code hash
user ID
session ID
server ID
schema and permissions context hash
risk level
creation time and expiry

This is why code mode in PMCP is safer than a single "run_query" or "run_script" tool. The server does not trust the client to faithfully carry validation state forward.

Policies Must Be About Business Actions, Not Just Syntax

One of the strongest parts of the PMCP design is the move away from language-specific permission models and toward a unified action model: Read, Write, Delete, and Admin.

That is a better abstraction than "GraphQL query vs mutation," "GET vs POST," or "SELECT vs UPDATE" because it maps directly to how the IT Administrator actually thinks about risk. A Cedar policy written in these four terms is something an administrator can read and defend. A policy written in HTTP verbs, SQL keywords, and GraphQL operation types quickly becomes something only the original author understands.

The PMCP SDK infers these actions from the underlying code: GraphQL queries map to reads; mutations map to writes or deletes; HTTP methods map to reads, writes, or deletes; and SQL statements map to reads, writes, deletes, or admin operations. Whether the data system sits behind the server, the IT Administrator writes policies using the same four verbs.

This gives you a portable permission model across server types and supports a staged rollout strategy that fits real organizations. It also maps cleanly to role-based access inside the organization. Standard users might be limited to read-only code mode. Power users might get access to a narrow write allowlist for the workflows they own. Administrators might be allowed to run broader maintenance or approval-oriented actions. The important point is that code mode does not have to be one global policy for every user. The same MCP server can apply different policy decisions based on who is asking and what role they hold — and the IT Administrator is the corner of the Pentagon that, in practice, decides where those lines sit.

In practice, a rollout often looks like this:

Allow reads for everyone.
Deny writes, deletes, and admin by default.
Observe how business users actually use code mode.
Add a small write allowlist for specific roles where the value is high, and the risk is acceptable.
Keep deletes and admin heavily restricted or fully denied unless there is a compelling reason and a clearly accountable role that should hold them.

That is much better than starting from full access and trying to claw it back later. It is also the shape of a rollout that the IT Administrator can actually defend to an auditor: writes, deletes, and admin were denied by default and opened only against documented business need.

In practice, the useful question for an MCP server author is not "which Rust types do I instantiate?" but "what code mode surface do I expose?" A good pattern used in our cost-coach server is to define that surface explicitly in the server config: enable code mode, set execution limits, and declare the operations that scripts are allowed to call.

[code_mode]
enabled = true
server_id = "cost-coach"
openapi_allow_writes = false
token_ttl_seconds = 300
max_api_calls = 10
max_loop_iterations = 50
execution_timeout_seconds = 30
auto_approve_levels = ["low"]

[[code_mode.operations]]
id = "getCostAndUsage"
category = "read"
description = "Historical cost and usage data by service, region, tag, or account"
path = "/getCostAndUsage"

[[code_mode.operations]]
id = "getCostForecast"
category = "read"
description = "Forecast future costs with confidence intervals"
path = "/getCostForecast"

[[code_mode.operations]]
id = "getCostAnomalies"
category = "read"
description = "Detect unusual spending patterns via Cost Anomaly Detection"
path = "/getCostAnomalies"

That is the level where most server authors should think. Which actions are enabled? Which operations are named and exposed? What limits apply to a script? Which risk levels can be auto-approved? The PMCP SDK then turns those declarations into the validation and policy boundary.

In the same cost-coach server, code mode is also paired with a start_code_mode prompt that loads the schema into context before the model writes JavaScript. That is an important design point: even with code mode, you still want to give the client bounded instructions and a bounded schema rather than expecting it to infer the server's code surface from scratch.

This is the governance story you want in production: start narrow, expand deliberately, and keep policy changes observable.

If you are using Cedar with PMCP today, the key design choice is who serves as the Cedar principal. In classic authorization terms, the principal is the actor, who is the entity whose permissions are being evaluated. For code mode, that actor is the authenticated user (or the group they belong to), not the generated script. The script is something the user produced; it is not itself the party being authorized.

The cleanest model for code mode Cedar policies, therefore, looks like this:

principal = the authenticated user, with group memberships like Admins, PowerUsers, or StandardUsers
resource = the CodeMode::Server configuration being accessed (cost-coach, cost-coach:power-user, cost-coach:administrator, and so on)
action = the unified business action (Read, Write, Delete, Admin)
context = the structural facts about the generated code (called operations, accessed fields, statement type, estimated cost, output fields)

With that shape, role-based policies read naturally:

// Anyone in the StandardUsers group can read
permit (
    principal in CodeMode::Group::"StandardUsers",
    action == CodeMode::Action::"Read",
    resource
);

// Power users can perform narrow writes against named operations only
permit (
    principal in CodeMode::Group::"PowerUsers",
    action == CodeMode::Action::"Write",
    resource
)
when {
    resource.serverId == "cost-coach" &&
    resource.allowWrite == true &&
    context has script &&
    context.script.calledOperations.contains("update_budget_note") &&
    resource.allowedOperations.contains("update_budget_note")
};

// Only members of the Admins group may run admin-grade code mode actions
permit (
    principal in CodeMode::Group::"Admins",
    action == CodeMode::Action::"Admin",
    resource
)
when {
    resource.serverId == "cost-coach" &&
    resource.allowAdmin == true
};

// Regardless of group, never let a script leak blocked output fields
forbid (
    principal,
    action,
    resource
)
when {
    context has script &&
    context.script.outputFields.containsAny(resource.outputBlockedFields)
};

That reads the way an IT Administrator thinks about authorization: "members of the Admins group can perform Admin actions against this server, as long as the server allows it." The code-artifact details, called operations, accessed fields, output fields, and live in context because they describe the request, not the actor.

Cedar Is A Good Fit Because The Problem Is Authorization

PMCP's use of Cedar is not incidental. Cedar is a policy language designed for authorization, and Amazon Verified Permissions uses Cedar as its policy foundation. That matters for code mode because the real question is not "can I parse this SQL?" or "can I scan this JavaScript?" The real question is "should this principal be allowed to perform this action on this resource in this context?"

The PMCP SDK supports:

local Cedar evaluation through its built-in Cedar integration
cloud-backed policy evaluation through AVP integrations
custom policy backends for teams that need a different authorization layer

This gives you two important properties:

auditability: policy changes become explicit artifacts rather than hidden branches in application code.
portability: the same read/write/delete/admin model can be evaluated locally in Rust or through a managed authorization system when deployed on AWS.

There is also a pragmatic Rust point here. PMCP is a Rust SDK, and Cedar is also implemented as a Rust library. That improves integration quality, performance, and operational simplicity compared to bolting an unrelated policy engine onto the server.

What The IT Administrator Actually Controls

At this point, it is worth naming, in one place, what the IT Administrator corner of the Pentagon actually touches on a day-to-day basis. If you are an administrator adopting a PMCP code mode server, these are the levers you will own:

Policy set. The Cedar policies (or AVP policy store) that decide which actions are permitted for which roles against which server configurations. This is where the role-specific rules — standard user, power user, administrator — are codified.
Role-to-server routing. Which Server configuration and policy store each authenticated role is evaluated against. This is how you express "power users use cost-coach:power-user, administrators use cost-coach:administrator" without duplicating logic in application code.
Operation allowlist. Which named operations scripts are allowed to call at all? Adding an operation is a deliberate change, not a side effect of adding a tool.
Blocked output fields. The fields that must never appear in a script's response, regardless of which operation produced them.
Execution limits. max_api_calls, max_loop_iterations, execution_timeout_seconds, token_ttl_seconds.
Auto-approval thresholds. Which risk levels skip human approval and which require it?
Signing secret. The HMAC secret used to bind approval tokens. Managed like any other production credential, rotated on a schedule, and never shared with the client.
Audit trail. The logs of what was validated, what was approved, what was executed, and by whom. This is what turns code mode from a trust statement into an auditable one.

These are not "developer knobs left over after the server ships." They are the continuously administered controls that determine whether the code mode remains within the security/administration/power balance described above. The Business Analyst chooses the shape of the surface; the IT Administrator chooses how tightly the surface stays over time.

Field Privacy Control

We briefly mentioned the *Blocked output fields *. This is one of the main risks of free-form queries and scripts generated on the fly by a potentially compromised LLM. But the problem is not only to prevent the code from using their sensitive fields, such as SSN, Credit Card numbers, or other privacy-enforced fields. Internal or external IDs (such as SSNs) are critical for allowing JOINs between relational database tables, and we don't want to block the LLM from using them in its queries and scripts; however, we want to block them from appearing in the query results.

Adding Code Mode To A PMCP Server

The PMCP SDK keeps the integration path small on purpose, but for a server author, the important question is not how the SDK wires itself internally. The important question is what you need to define in your server so code mode has a safe, understandable shape that both the Business Analyst and the IT Administrator can work with.

In practice, that usually means four things:

Define the code mode surface in config.

The Business Analyst decides which operations exist, which categories they belong to, what execution limits apply, and which risk levels can be auto-approved.
Give the model a bounded starting point.

Expose a prompt like start_code_mode that loads the code mode schema and instructions into context before the model writes a script or query. This is the same prompt/resource pattern from the previous article, applied to code mode.
Bind validation to real identity.

The approval path should be scoped to the authenticated user, the current session, and the current permission state. Do not treat code mode as anonymous. This is how the IT Administrator's role-based policies actually take effect.
Use a real policy backend and a real signing secret.

In production, code mode should sit behind Cedar, AVP, or another real authorization layer, and the approval tokens should be signed with a real secret managed like any other production credential. Both are IT Administrator concerns, not developer defaults.

That is exactly the pattern used in cost-coach: the config defines the allowed operations, a start_code_mode prompt loads the schema into the context, the validation flow is bound to the authenticated request, and the server exposes only the two code-mode tools rather than a sprawling, ad hoc scripting surface.

Start Read-Only First

The best first production rollout of code mode is usually read-only.

For an OpenAPI-backed server, that can be as simple as:

[code_mode]
enabled = true
openapi_reads_enabled = true
openapi_allow_writes = false
openapi_allow_deletes = false
token_ttl_seconds = 300

# Fields that must never appear in script output
openapi_output_blocked_fields = ["ssn", "password", "salary"]

That configuration is not the whole security story, but it is the right starting posture:

reads enabled
writes disabled
deletes disabled
sensitive output fields blocked
short token lifetime

You can then move to targeted allowlists once you understand actual usage and risk. Start with the smallest useful surface and expand only where the observed value justifies the additional risk. In the Pentagon diagram, this is where the IT Administrator and the Business Analyst iterate together: the analyst sees which real requests fall through the cracks, and the administrator decides which of those can be opened without breaking the security/administration/power balance.

Code Mode Still Needs Resources

One of the easiest mistakes is to think code mode makes prompts and resources irrelevant. It does not. If the model is going to write code, give it a bounded context:

a derived code mode schema rather than the full backend schema.
instructions describing the expected style of code.
a summary of policy boundaries and prohibited operations.
examples of safe query patterns.

The PMCP code mode crate includes standard resource URIs for this purpose:

code-mode://instructions
code-mode://policies

This ties directly back to the previous article on prompts and resources. Code mode works better when the LLM is not guessing about the available operations or the policy boundaries.

Performance Is The Other Reason To Add Code Mode

So far, we have focused on security and governance. But code mode also exists because it can be materially faster and cheaper than forcing the client model to orchestrate a long sequence of ordinary tool calls. The gains come from three places:

fewer round-trips between client and server.
fewer intermediate results sent back into the model context.
less orchestration burden on the model.

That matters because multi-step tool chaining compounds cost and error probability. Every step adds latency, tokens, and another chance for the model to get lost. Code mode can collapse that choreography into one server-side program.

SQL Example: One Statement Instead Of Five Calls

Suppose the user asks:

"Show me the five sales reps whose accounts had the steepest revenue drop this quarter, but only if their customers opened more than three support tickets in the same period."

With ordinary tools, the model might need:

a revenue query
a support ticket query
a customer JOIN step
a ranking step
a formatting step

With SQL code mode, the model can ask the database directly for the final shape:

SELECT *
FROM (
    SELECT
        rep_id,
        customer_id,
        quarterly_revenue,
        LAG(quarterly_revenue) OVER (
            PARTITION BY customer_id
            ORDER BY quarter
        ) AS previous_quarter_revenue,
        ticket_count,
        quarter
    FROM account_quarterly_metrics
)
WHERE quarter = '2026-Q1'
  AND previous_quarter_revenue IS NOT NULL
  AND ticket_count > 3
ORDER BY
    (quarterly_revenue - previous_quarter_revenue) ASC
LIMIT 5;

The point is not that SQL is magical. The point is that the database is the right execution engine for joins, ranking, filtering, and window functions.

OpenAPI Example: Bind, Fan Out, And Shape Server-Side

The same logic applies to OpenAPI-backed servers that support JavaScript-based execution plans.

Suppose the user asks:

"For the top ten budgets that are forecast to exceed target, fetch the owner details and return only name, team, amount, and forecast delta."

Without code mode, the model may need to:

fetch budgets
select the top ten
loop over them
fetch each owner
combine the responses
shape the output

With JavaScript code mode, that becomes one plan:

const budgets = await api.post("/budgets/listForecasts", {
  month: args.month
});

const top = budgets.items
  .filter(b => b.forecast > b.limit)
  .sort((a, b) => (b.forecast - b.limit) - (a.forecast - a.limit))
  .slice(0, 10);

const owners = await Promise.all(
  top.map(item => api.get(`/users/${item.ownerId}`))
);

return top.map((item, i) => ({
  budget: item.name,
  owner: owners[i].name,
  team: owners[i].team,
  amount: item.limit,
  forecast_delta: item.forecast - item.limit
}));

This is exactly the sort of long-tail request that is awkward to capture as one dedicated tool but inefficient to force through a half-dozen round-trip requests.

GraphQL Example: One Operation Instead Of A Field-by-Field Walk

The same pattern applies to GraphQL-backed servers, where the LLM can describe the exact shape of the data it wants in a single operation rather than a chain of lookups.

Suppose the user asks:

"For the three most recently onboarded customers, give me their name, segment, their last two orders with total, and any open support tickets."

With ordinary tools, the model might need one call to list recent customers, then, for each customer, a call to list orders, a call to fetch order totals, and a call to fetch open tickets. That is roughly nine round-trip for three customers, each feeding back into the model context.

With GraphQL code mode, that collapses into one operation:

query RecentCustomersSnapshot {
  customers(orderBy: { createdAt: DESC }, limit: 3) {
    id
    name
    segment
    orders(orderBy: { placedAt: DESC }, limit: 2) {
      id
      total
      placedAt
    }
    supportTickets(filter: { status: OPEN }) {
      id
      subject
      priority
    }
  }
}

The MCP server validates this operation against the GraphQL Code Mode policy: the principal is CodeMode::Operation with attributes like operationName, depth, and accessedFields; the resource is the server configuration; the action is Read. Blocked fields never leave the server. One authenticated request, one validated operation, one shaped response.

That is why code mode improves:

latency, as the users don't need to wait too long.
token efficiency, where the LLM only generates one request and processes only a single response.
task coverage, as most LLMs are optimized to generate working code snippets, and less on complex workflow orchestration.
reliability on complex long-tail requests

Across all data-system shapes (SQL, OpenAPI, GraphQL, for example), the story is the same: move the computation into the execution engine built for it, and let the MCP server remain a thin, validated gateway.

The Design Rule: Let The Best Engine Do The Work

This is the same theme that ran through the previous articles:

Let the server do deterministic work.

Let the database do database work.

Let the API gateway do API work.

Let the authorization engine do authorization work.

Let the LLM do language work.

Code mode is good when it moves computation into the right execution engine without expanding the exposed capability surface too far. It is bad when it becomes a shortcut around careful interface design.

A Practical Rollout Playbook

For most teams, the safest way to add code mode is:

Design the normal tools first.
Add prompts and resources for the known workflows.
Measure the requests that still fall through the cracks.
Add code mode in read-only mode.
Bind validation to real user, session, schema, and permission context.
Use Cedar or AVP-backed policy evaluation from day one.
Auto-approve only low-risk reads.
Require human approval for writes, deletes, and admin-level actions.
Expand the allowlists only when real usage data justifies it.

This is the operating model that keeps code mode from becoming an uncontrolled second API. It also keeps the responsibilities of the Pentagon clear:

the Business Analyst decides what should stay as tools versus move to code mode, and which operations are worth naming.
the Business User brings the long-tail requests that guide where the surface should expand next.
the LLM / MCP Client generates the SQL, JavaScript, or GraphQL code against the bounded schema.
Engineering and the MCP Server implement the validation, signing, and execution boundary.
the IT Administrator governs the Cedar / AVP policies, signing secrets, role routing, approval thresholds, and audit trails — continuously, not just at launch.

When that five-way handoff is working, code mode becomes a governed part of the platform instead of an ad hoc backdoor.

Key Takeaways

Code mode is a data-system problem, not a generic "let the LLM code" problem. The point is to speak the native query language of the database, OpenAPI-backed service, or GraphQL API behind the MCP server. That is where the LLM and the backend both get real leverage.
Code mode extends the Capability Square into a Capability Pentagon. The four original corners: Business Analyst, Business User, LLM/MCP Client, and MCP Server, are joined by the IT Administrator, who owns the continuously administered security policies that keep code mode safe over time.
Design for the three-way balance: security, administration, and LLM power. Lock down too hard and nobody uses code mode; open up too wide, and it becomes an ungoverned remote shell; skimp on administration, and you cannot tell a standard user from an admin. PMCP's validate/execute split, role-aware Cedar policies, and bounded schemas are there to keep all three dials workable at once.
Code mode is additive, not foundational. Use curated tools, prompts, and resources for the common requests. Use code mode for the remaining long tail.
Do not treat code mode as arbitrary backend access. Assume the client can be hostile or compromised. The server must enforce policy at the business-system boundary, not just inside a sandboxed interpreter.
The PMCP SDK secures code mode through layers. validate_code, execute_code, policy evaluation, approval tokens, and optional human approval each address a different failure mode.
The approval token is a core security primitive. It binds the validated code to the user, session, server, context, risk level, and expiry, preventing post-validation code substitution.
Permission design should be based on unified business actions. Read, Write, Delete, and Admin are the right categories for governing code mode across SQL, OpenAPI, and GraphQL servers — and the right vocabulary for the IT Administrator's policies.
Cedar is a strong fit because this is an authorization problem. Local Cedar evaluation and AVP-backed evaluation both give you a policy system that administrators can reason about and audit over time.
Start read-only. Deny writes, deletes, and admin actions first, then open narrowly scoped allowlists only after observing real usage and evaluating the risk.
Code mode also improves performance. For long-tail analytical requests, a server-side query or execution plan often beats multi-step tool chaining on latency, tokens, and reliability.

Continue the Series

This article covered code mode as the controlled long-tail mechanism in a well-designed MCP server, and the IT Administrator's corner of the Pentagon that keeps it governed over time. The rest of the series goes deeper.

Need the foundation first? Read Tool Design for the Capability Square, outcome-oriented tools, and why fewer tools perform better.
Need workflow support? Read Prompts and Resources for the control-plane model and hybrid execution patterns that code mode builds on.
Need production validation? Read Testing MCP Servers for the five-gate testing lifecycle, including explicit tests for validate_code and execute_code.
Concerned about security architecture? MCP Security covers authn, authz, secret handling, and the threat model behind these controls in more depth.
Need long-running execution? Tasks for MCP covers the explicit task lifecycle for work that should not happen in a single request.

For hands-on practice with these patterns, the Advanced MCP course provides guided exercises building production MCP servers in Rust with the PMCP SDK.

Testing MCP Servers: The Five Gates Between Demo and Production

Guy — Fri, 24 Apr 2026 00:40:31 +0000

"MCP servers should be tested similarly to web and mobile applications."

By the end of this article, you will know the five tests that turn an MCP demo into a production gate, and you will know exactly where each one belongs in the life of a real MCP server: before release, after deployment, and across the schema changes that inevitably come later.

Your MCP server works in the demo. The tools show up in the client. A few manual calls succeed. Then you deploy it behind a real auth layer, invite real users on it, and the failures start. The client auto-detects the wrong transport. A tool schema drift breaks a scenario you thought was stable. Latency spikes under concurrent load. A prompt that looked harmless turns out to be vulnerable to injection. None of these failures is unusual. They are what happens when a production interface is tested like a toy.

This is why MCP server testing has to be treated as a first-class engineering discipline. If MCP for AI is analogous to HTTP for humans, then MCP servers are the AI-facing web servers and mobile applications of your organization's backends. They are remote services. They are security-sensitive. They are mostly stateless interface layers over internal systems. And like any other production interface layer, they need a full testing lifecycle, not a single “does this tool work?” check.

If you read the previous articles in this series, you already know the design principles: outcome-oriented tools, prompts for repeatable workflows, resources for governed context, and code mode as a controlled long-tail escape hatch. This article covers the next question: how do you prove that a server designed that way actually works in production?

Here is the central claim: most MCP server failures are boundary failures, not model failures. The model gets blamed because that is what the user sees. But in production, the break usually happens at the boundary: handshake, schema, workflow, scale, or security. That is the concept worth carrying through the rest of the article.

To make this concrete, I will use a recurring example from an MCP App I published: Chess Coach. It is an MCP server with UI widgets that let a player paste a game as PGN, FEN, or a move list and request position analysis, move suggestions, opening principles, or endgame guidance. It is exactly the kind of server many teams underestimate. Because it feels narrow, people are tempted to think of it as a personal assistant living near one developer's desktop. It is not. Once published, it becomes a durable interface that multiple hosts, users, and versions of different clients will call over time. The tools may evolve. Prompts may be added. Resources may be revised. Widget metadata may need to satisfy multiple host runtimes. Users might want to bypass the freemium subscription and get premium features for free. That is what the tests need to protect.

What Is MCP? (The 30-Second Version)

The Model Context Protocol (MCP) defines the interface between AI clients and external systems through tools, prompts, and resources. In enterprise deployments, you should think of the MCP server as a thin, remote, AI-facing interface layer over internal systems. That has three immediate testing consequences.

First, you are not only testing business logic. You are also testing handshake behavior, transport behavior, capability discovery, auth boundaries, and response contracts. Second, because the server should be mostly stateless, many of the highest-value tests happen at the protocol boundary: can any compliant client connect, discover, invoke, and recover correctly? Third, because this interface is security-sensitive, testing must include not only correctness and performance, but also active security probing.

There is also an ecosystem reason to take this seriously. There will be many more MCP servers than MCP clients, just as there are many more websites than browsers. That means server quality is where ecosystem reliability is won or lost.

And this is the mindset shift many teams still need to make: a production MCP server is not your personal local assistant running next to your IDE. It is an interface for AI agents that act on behalf of thousands of users over time. It will outlive the first developer who wrote it. Its tool schemas will change. New prompts will be added. Resources will be revised. Secrets will rotate. Clients will interpret the surface differently. If it is an MCP App, the widget contract will evolve too. Once you see it that way, testing stops looking like developer hygiene and starts looking like interface governance.

The Five Gates

When teams say they have “tested” an MCP server, they often mean one of two weak things: they clicked around in a visual tool, or they called a tool once and got the expected output. That is useful, but it is not sufficient.

A practical testing strategy for MCP servers has five production gates:

Smoke: Can the server be reached, initialized, and discovered?
Conformance: Does it actually behave like a compliant MCP server?
Scenarios: Do real workflows keep working release after release?
Load: What happens when concurrency, latency, and throughput become real?
Pentest: What happens when the client is adversarial rather than friendly?

That five-gate frame is the symbol for this article. If your server has not passed all five gates, it is still a demo. That stack mirrors the actual risk profile of a production MCP server. A remote MCP server can fail at the protocol, workflow, scale, or security layer. Good testing covers all four, and good release discipline turns them into explicit gates.

You can also think of the stack in terms of tooling roles:

MCP Inspector for interactive exploration and debugging during development
mcp-tester / cargo pmcp test for automated protocol, capability, and scenario testing that makes those tests easier to integrate into the development lifecycle
cargo pmcp preview for interactive exploration and debugging during development of UI based MCP Apps
cargo pmcp loadtest for performance and capacity validation
cargo pmcp pentest for security validation and release gates

That progression matters. The Inspector helps you understand what the server is doing. The CLI tools help you prove that it continues to do it correctly over time. That is the proof of work: not that you ran one command, but that you built a repeatable validation path from development to production. The point is not to admire the surface once. The point is to keep trusting it after the tenth schema revision and the thousandth user.

Inspector For Humans, CLI For Pipelines

The official MCP Inspector is still the right place to start when you are designing or debugging a server by hand. It gives you the visual feedback loop: discover tools, inspect schemas, call operations manually, read resources, test prompts, and watch the protocol messages move.

That makes it excellent for:

interactive exploration during development
debugging a broken schema or response
understanding how a new capability behaves
quick manual smoke checks

A practical variant many teams find useful for ad-hoc smoke testing is to use an MCP-capable client such as Claude Code (or other LLM-driven developer clients). After registering your server with the client (for example: claude mcp add <server-name> -t <http server-url>), you can ask the client in natural language to exercise the server: initialize it, list capabilities, call a couple of tools, and report back the results. These clients can run quick checks efficiently and even produce a short test summary or report. That workflow is extremely convenient for exploratory development or a quick sanity check after deployment.

Two important caveats apply. First, running ad hoc checks via an LLM-driven client is not as repeatable or auditable as running versioned scenario tests in your repository. Second, subtle differences in how different clients interpret prompts or present results mean that an LLM-driven smoke check is a complement, not a replacement, for formal scenario testing. In practice, we recommend using client-driven checks to quickly discover issues and generate or refine scenarios — and then committing those scenarios to your test suite so they can be executed deterministically by cargo pmcp test (or mcp-tester) as part of CI.

But it is not the end of the testing story. It is the beginning.

Once the server surface is understood, the work must move to automated tooling. That is where mcp-tester comes in. It is the engine for repeatable protocol and scenario testing, and cargo pmcp test builds on the same model, making testing part of the normal server lifecycle rather than a separate exercise. That transition is the difference between a personal development server and a production interface. Personal servers are explored. Production servers are exercised repeatedly.

The Lifecycle Matters More Than The Tool

One of the design goals behind the PMCP testing tooling is that the lifecycle should be easy to adopt, even if your server is not written in Rust.

That point matters. Unit tests in your server code may be language-specific. But the server-level testing surface is at the protocol level. If your server speaks MCP over HTTP or stdio, the same testing workflow can validate whether the implementation is Rust, TypeScript, Python, or Go.

This is the practical split:

Use your language-native test framework for unit tests and server-internal logic.
Use mcp-tester directly if you want a standalone protocol-level tester.
Use cargo pmcp test for smoke checks, capability discovery, conformance, scenario execution, and project-level test management.
Use cargo pmcp loadtest for concurrency, latency, and capacity validation.
Use cargo pmcp pentest for active security probing and CI security gates.

The important point is that cargo pmcp test is not trying to replace language-native tests. It wraps the server-facing lifecycle around the underlying tester, so teams can check, generate, run, upload, download, and list scenarios within the same development flow they already use for building and deploying servers.

That gives enterprise teams a single testing lifecycle even when they have multiple server implementations across different languages.

For the Chess Coach MCP App, that means the same lifecycle can validate more than chess logic. It can validate that move-analysis tools are discoverable, that opening and endgame resources remain readable, that prompt workflows still guide the user correctly, and that the board widget still renders through host-specific app contracts. That is the right scope. A production MCP App consists of a server and an interface contract, not just a handler.

Gate 1: Smoke

The fastest way to test a server is not a browser demo. It is a protocol smoke test.

cargo pmcp test check http://localhost:3000

This verifies the things that actually matter first:

The server is reachable
The initialization handshake works
capabilities are advertised
tools, resources, and prompts can be discovered

For remote servers, this is also the quickest way to catch annoying operational failures that only show up after deployment: wrong URL, wrong transport, wrong auth setup, proxy behavior, or non-compliant responses.

For the Chess Coach, smoke is not merely “does analyze_position return a score?” It is: can the host initialize the server, discover the chess tools, discover the related resources and prompts, and see the app surface without tripping over deployment-time mistakes? This is where missing secrets, broken environment configuration, or a bad reverse-proxy setup show up immediately. We have repeatedly seen servers fail here, not because the business logic was wrong, but because capability discovery depended on a configuration that was absent in one environment. If a server cannot reliably complete initialization, the rest of the product is unavailable to the client.

In practice, one of the most common bottlenecks we found while load testing MCP servers was not in the tool handler at all. It was in initialize. Cold starts made the first handshake far more expensive than teams expected, especially when capability discovery rebuilt the same metadata every time. In many cases, the fix was not a heroic optimization. It was simply caching what could safely be cached: server metadata, tool schemas, prompt definitions, resource descriptors, and app metadata. Smoke tests matter because they force you to look at the first contact, and that's often where production feels slow.

If you are debugging a transport issue, check is where you start. By using the --verbose flag, you can see the raw JSON-RPC messages on the wire. This is a lifesaver when diagnosing boundary failures—such as the server returning valid HTML instead of JSON, or sending unexpected protocol frames:

cargo pmcp test check https://api.example.com/mcp --verbose
cargo pmcp test check https://api.example.com/mcp --transport jsonrpc
cargo pmcp test check https://api.example.com/mcp --transport http

This is one of the most important practical shifts for enterprise teams. Treat the MCP handshake like you would treat a health check or API contract check, not like an afterthought. A server that fails to initialize reliably is not “mostly working.” It is down. Smoke is the first gate because nothing above it matters until the boundary is alive.

Gate 2: Conformance

A reachable server is not automatically a correct server. It may initialize and still violate MCP expectations in ways that break real clients later.

That is why the next layer is conformance:

cargo pmcp test conformance https://api.example.com/mcp --strict

The conformance command validates the server against the 2025-11-25 MCP protocol spec across protocol domains, including:

core handshake and error behavior
tools
resources
prompts
tasks

That last one matters especially if your server implements long-running operations. Tasks are the explicit exception to the otherwise stateless request model, so they need explicit lifecycle testing too.

This is also where the “remote, secure, mostly stateless” architecture becomes testable as a concrete property instead of a slogan. A compliant server should expose clear capabilities, clear transitions, and predictable behavior at the protocol boundary. If it does not, clients are compensated poorly, and your business users perceive it as “the agent is unreliable.”

This is also where multi-client reality shows up. A server can appear acceptable on one client and still be off-spec enough to behave differently on another. In practice, schema conformance is where differences between hosts, such as ChatGPT and Claude Desktop, become concrete. Tool descriptions, prompt arguments, resources, app metadata, and transport assumptions are all interpreted through slightly different host behaviors. The discipline here is simple: if a change modifies tools, adds prompts, updates resources, or adjusts metadata, treat it as an interface change and run conformance tests against the clients that matter to you. Conformance is the second gate because reachability without correctness is just a subtler failure.

The Chess Coach example makes this visible quickly. A seemingly small schema change, such as adjusting the move-list input shape, renaming a field in a move-suggestion response, or revising widget metadata for the board view, may still appear to work correctly in a single manual test. But if ChatGPT expects one app descriptor shape and Claude Desktop is stricter about another edge of the contract, you do not have “one minor change.” You have a compatibility regression across clients.

Gate 3: Scenarios

Conformance tells you the server is valid. It does not tell you it is useful.

That is where scenario testing comes in. Scenario tests model actual workflows instead of isolated calls. They are the closest thing to a regression suite for business outcomes.

The PMCP tooling supports two practical steps here:

Generate a starter scenario from the server’s declared capabilities.
Edit it into a real regression test based on your users’ workflows.

In addition, LLMs can often generate starter scenario files directly from testing requirements specified by a business analyst: describe the business-oriented acceptance criteria, and the LLM can produce a first-pass scenario YAML which engineers then refine and version as part of the test suite.

cargo pmcp test generate http://localhost:3000
cargo pmcp test run http://localhost:3000

This is the same generate-then-curate pattern we discussed for tool design. The generated scenario is not the final artifact. It is the starting point. You replace placeholder values, add assertions around real business behavior, and version the scenario in your repository.

If you are using pmcp.run as the hosting and operations layer, the same lifecycle extends into remote test management. You can upload these scenarios to the platform so that the operations team can run the exact same tests the developers wrote as scheduled production health checks:

cargo pmcp test upload --server my-server scenarios/
cargo pmcp test list --server my-server

That matters for enterprise teams because testing should not stop at the local repository boundary. The same scenarios that protect development should also be attachable to the deployed server lifecycle.

That is how testing becomes domain-led, engineering-implemented, platform-governed:

The domain lead decides which workflows matter
engineering turns them into executable scenario tests
The platform team runs them in CI and release pipelines

For MCP servers, this is vastly more valuable than a grab bag of manual checks because it tells you whether the real outcomes continue to work. For the Chess Coach, those outcomes are easy to picture and worth encoding. A scenario might look like this:

- name: 'Analyze Sicilian Defense'
  operation:
    type: tool_call
    tool: analyze_position
    arguments:
      fen: 'rnbqkbnr/pp1ppppp/8/2p5/4P3/8/PPPP1PPP/RNBQKBNR w KQkq c6 0 2'
  assertions:
    - type: contains
      path: "content[0].text"
      value: "Master" # Expecting high-level opening guidance

When you have encoded these workflows, you get repeatable answers to questions like:

a player pastes a PGN or move list and gets a coherent position analysis
A follow-up request for move suggestions returns legal, ranked options
an opening position triggers opening guidance rather than generic advice
A simplified late-game position triggers endgame principles instead of opening theory
The board widget stays aligned with the returned analysis, so the user can scroll the conversation and understand the game evolution

If you want one testing habit that pays back immediately, it is this one: every important prompt or outcome-oriented tool should have at least one scenario test that models the way a real business user invokes it.

This is also the gate that catches the quiet regressions teams otherwise miss. A tool gets renamed. A prompt argument changes shape. A resource URI moves. A server starts requiring a secret that is present in one environment and missing in another. On paper, none of these changes looks dramatic. In production, there are interface breaks. Scenario tests make those breaks visible before users do. Scenarios are the third gate because this is where protocol validity becomes business confidence.

Test The Capability Surface, Not Just The Handler

The testing surface should mirror the MCP surface:

Tools: validate discovery, schema quality, error handling, and expected outputs
Prompts: validate workflow shape, step ordering, and message generation
Resources: validate discoverability, readability, and URI stability
Tasks: validate creation, polling, transitions, and completion semantics
Apps: validate widget metadata and host compatibility if your tools return UI

This is one reason a generic API tester is not enough. MCP adds capability discovery and agent-facing metadata on top of ordinary request/response behavior. Those surfaces need explicit tests.

If you are building MCP Apps, the apps subcommand is particularly useful:

cargo pmcp test apps http://localhost:3000
cargo pmcp test apps http://localhost:3000 --mode chatgpt --strict
cargo pmcp test apps http://localhost:3000 --mode claude-desktop --strict

That catches the class of failures where the tool itself works, but the widget metadata or resource references are broken for the host runtime.

For the Chess Coach, that matters as much as schema conformance for the tools themselves. A user may successfully invoke move analysis and still have a broken product experience if the annotated board widget points to the wrong ui.resourceUri, advertises the wrong MIME type, or uses metadata that one host tolerates and another rejects. The interface is not only the JSON schema. The interface is the full contract between the server and the host runtime.

This is where MCP Apps force a useful discipline on teams. If your product promise includes visual state, then host compatibility is part of server correctness. ChatGPT compatibility and Claude Desktop compatibility are not “nice to have UI checks.” They are conformance checks on the real user-facing interface.

Gate 4: Load

A local tool call that succeeds once tells you almost nothing about production behavior.

Remote MCP servers sit on network paths, auth layers, databases, downstream APIs, and often serverless or autoscaled infrastructure. The real question is not only “does it work?” but “what happens at 10 users, 100 users, or when the client starts parallelizing?”

That is the role of cargo pmcp loadtest.

cargo pmcp loadtest init https://api.example.com/mcp
cargo pmcp loadtest run https://api.example.com/mcp

The load testing model is deliberately scenario-driven. The generated loadtest.toml defines a weighted mix of MCP operations, for example:

a high percentage of tool calls
a smaller percentage of resource reads
prompt fetches where relevant

That is a better match for real MCP traffic than a generic HTTP benchmark because it tests the actual capability mix your server exposes.

For the Chess Coach, realistic traffic is not a single, repeated call forever. It is a mix: initialize, capability discovery, a sequence of move-analysis requests, occasional prompt or resource lookups, and widget-related retrieval. That is exactly the kind of blend that reveals whether the server was designed as a thin, reusable interface or as a local helper process that happened to be deployed.

In practice, load testing answers questions your platform team will care about immediately:

What are the p95 and p99 latencies for actual MCP operations?
How does latency degrade as concurrency rises?
Which tool or capability is the bottleneck?
How much capacity do we need before rollout?

Instead of just running a benchmark and giving you a final report, the loadtest engine actively searches for capacity. It features a built-in BreakingPointDetector that monitors a sliding window of metrics. It will automatically detect and warn you exactly when your server starts to "break" during a ramp-up—whether that means error rates suddenly spiking or p99 latency degrading significantly compared to the baseline.

Furthermore, the engine implements coordinated omission correction. If your server stalls for five seconds under load, it doesn't just record one slow request. It accounts for all the requests that should have been sent during that stall, giving you a brutally honest p99 for production planning.

This is especially important if your architecture claims to be stateless. Stateless services are supposed to scale horizontally. Load testing is how you verify that they actually do.

In our experience, load testing has been especially good at exposing two kinds of self-inflicted bottlenecks. The first is cold-start work pushed into initialize that should have been cached or precomputed. The second is hidden dependency work, especially secret resolution or configuration discovery that looked harmless in development and became expensive or brittle in production. If a server needs a secret to answer a tool call, that is one thing. If it breaks the handshake or capability discovery due to a missing secret, that is a design bug.

That distinction matters for the Chess Coach, too. The server may legitimately need engine setup or downstream configuration to analyze positions. But it should not force every first-time client to rebuild capabilities or fail the handshake due to a missing unrelated secret. A remote MCP server has to be resilient at the boundary before it can be useful in the middle. Load is the fourth gate because scale failures are boundary failures too, just delayed ones.

Gate 5: Pentest

Authentication is not security. A valid JWT does not prove the server is safe. A private deployment does not prove the interface is hardened. And a server that behaves correctly for friendly clients may still behave dangerously for adversarial ones.

That is why the testing lifecycle needs an explicit security phase:

cargo pmcp pentest https://api.example.com/mcp

The pentest engine discovers the attack surface and then runs MCP-specific and transport/security-focused categories. Because it is protocol-aware, it goes far beyond generic web scanning. For example:

Tool Poisoning (Rug Pull Detection): It compares tool descriptions between the initial discovery and subsequent calls to detect if a server changes its "personality" or instructions after the handshake.
Data Exfiltration: It probes resource URIs for SSRF vulnerabilities, actively searching for cloud metadata endpoints (like 169.254.169.254).
Auth Flow: It verifies your JWT implementations by testing for algorithm confusion (e.g., trying alg:none or using weak keys).
Prompt Injection: It tests for system prompt extraction and instruction overrides directly through tool arguments.
Session Security, Transport Security, and Protocol Abuse checks complete the suite.

The profiles are intentionally practical:

# MCP-specific checks first
cargo pmcp pentest https://api.example.com/mcp --profile quick

# Full scan
cargo pmcp pentest https://api.example.com/mcp --profile deep

# CI gate
cargo pmcp pentest https://api.example.com/mcp --fail-on medium --format sarif --output findings.sarif

This is the right mental model for enterprise teams: pentesting is not a separate security ceremony that happens once a quarter. It is a repeatable part of the server lifecycle, with severity thresholds and machine-readable output. Pentest is the fifth gate because secure-by-design is not credible until it has survived hostile input.

The PMCP pentest engine is opinionated in a way that aligns well with the MCP threat model. It is not just throwing random HTTP garbage at a server. It knows about MCP-specific attack classes such as tool poisoning and prompt injection, which is exactly what you want from a protocol-aware security tool.

Even the Chess Coach example benefits from this discipline. A chess server sounds benign until you remember that the input surface may include arbitrary PGN text, comments, annotations, or long move histories supplied through an AI client. If the server, its prompts, or its widget metadata interpret that input too generously, the product can still become a security problem. Pentest is what keeps a “harmless domain app” from becoming an “unguarded remote interface.”

Adopt the Pattern Across the Lifecycle

A practical MCP testing pipeline is straightforward:

During development: use quick smoke checks and interactive exploration to validate the server surface.
Before merge: run unit tests plus curated scenario tests that reflect real business workflows.
Before release: run conformance checks and targeted load baselines.
As a release gate: run a pentest with policy thresholds and block deployment on serious findings.
In production: keep running smoke and scenario checks against the live endpoint.

The important step is adoption. Use the same testing model across local development, CI, staging, and production so your team is validating the real interface, not just a local approximation of it.

This workflow is also meant to grow with the community. If your team develops useful scenario suites, domain-specific load profiles, or new categories of protocol-aware security checks, contribute them back to the project. That is how MCP testing becomes stronger for everyone.

A Short Closing Thought

Start small, but start now. Add smoke checks first. Then add one or two real scenario tests for the workflows your users care about most. From there, make conformance, load, and pentest part of the normal release lifecycle.

If you are building MCP servers seriously, treat testing as part of the product surface. Adopt these patterns throughout your development lifecycle, use them consistently across teams, and contribute improvements back to the open-source project so the ecosystem keeps getting better.

Continue the Series

This article covered how to validate an MCP server across the full lifecycle: protocol correctness, workflow regression, performance, and security. The rest of the series goes deeper.

Concerned about security architecture? MCP Security covers secure-by-design server architecture, OAuth 2.1, input validation, and the threat model behind the pentest categories.
Building from an existing API spec? Schema-Driven MCP Servers shows how to generate and then prune a server surface before you lock in your tests.
Need interactive UI? MCP Apps covers building MCP Apps with UI widgets, and why app metadata testing belongs in the same release pipeline.
Interested in code mode? Code Mode for MCP explores the validate_code then execute_code pattern and the controls that should be tested around it.
Need long-running execution? Tasks for MCP covers the task lifecycle and how to test the one major exception to the stateless request model.

MCP Prompts and Resources: The Primitives You're Not Using

Guy — Thu, 09 Apr 2026 17:36:08 +0000

Your user asks for a weekly sales report. The LLM has four tools available: querying the database, aggregating data, calculating trends, and formatting the output. It chains them together. Steps 1 and 2 go fine. Step 3 goes wrong: the LLM tries to calculate week-over-week percentage changes itself, mixes up which week is the baseline, and produces a report showing 340% growth in a category that actually declined. The user gets a polished, confident, completely wrong report.

This isn't a contrived scenario. It's the predictable outcome of asking an LLM to choreograph a multi-step workflow where some steps require symbolic computation. The LLM is good at language. It is bad at arithmetic. And when you give it tools for each individual step, you're asking it to be good at something else entirely: sequencing, data flow management, and knowing which steps it should delegate versus attempt itself.

Now consider the alternative. The same user clicks a single prompt: "Weekly Sales Report." The server executes the deterministic steps: queries the database, aggregates by category, calculates trends server-side using exact arithmetic, and hands the LLM a precomputed dataset with one instruction: format this as an executive summary. The report is correct every time because the server handled the parts that require precision, and the LLM handled the parts that require language.

If you read our article on tool design, you know how to build tools that LLMs can use well. But tools solve individual tasks. What about multi-step workflows where the steps must happen in a specific order, with data flowing between them, and some steps requiring computation that LLMs shouldn't be doing? That's where MCP's second primitive comes in: prompts.

The business analyst — one of the two human corners of the Capability Square we introduced in that article — knows which workflows their business users run every week. The right operating model is domain-led, engineering-implemented, platform-governed: the analyst brings the workflow knowledge, engineers implement it, and the platform team governs how it runs in production. That is what lets the weekly sales report, the incident response runbook, or the customer onboarding checklist show up as a reliable one-click workflow for the person who actually runs it.

What Is MCP? (The 30-Second Version)

The Model Context Protocol (MCP, spec 2025-11-25) defines three primitives for connecting AI models to external services: tools, prompts, and resources. The previous article covered tools, which are model-controlled primitives that let LLMs invoke server-side operations. This article covers the other two: prompts, which are user-controlled workflow packages, and resources, which provide application-controlled context. Together, the three primitives form a complete system for AI-service integration.

The enterprise mental model is the same one from the previous article: MCP for AI is what HTTP-based applications are for humans. MCP servers are the AI-facing web servers or mobile applications for your organization's data systems, which is why they are usually remote services rather than local helpers. They should also be thin and mostly stateless: an interface layer over internal systems, not a stateful application tier of their own. The main exception is explicit long-running task handling, where state is persisted deliberately because the work itself outlives a single request, aka MCP Tasks. We will describe Tasks in a future article in the series.

The Three Control Planes

MCP's three primitives aren't just three types of capability. They represent three distinct control planes, or three answers to the question "who decides when this gets used?"

Tools are model-controlled. The LLM (model) decides when to invoke them. When a user asks, "Where's my order?", the LLM selects track_latest_order from the available tools. The user never explicitly chose that tool; the LLM's reasoning did. This is the right model for individual tasks where the LLM's judgment about which tool to call is sufficient.

Prompts are user-controlled. The human explicitly triggers them. In Claude Desktop, they appear as slash commands. In other clients, they show up as menu items or quick actions. The user sees "Weekly Sales Report" and clicks it, entering a week number. There's no ambiguity about what will happen, no LLM judgment about which workflow to run. The user chose.

Resources are application-controlled. The host application decides when to pull them into context. A resource might be a database schema, a configuration file, or a live dashboard. The application injects it into the conversation when relevant. For example, loading an API schema before a coding task. Neither the user nor the LLM explicitly requested it; the application determined it was needed.

This taxonomy tells you which primitive to use. If the LLM should decide, use a tool. If the user decides, use a prompt. If the application decides to use a resource.

In practice, many enterprise deployments add one more concept on top of these three primitives: Tasks. Tasks are not part of the base three-way split. They are an extension pattern for long-running operations such as scans, report generation, provisioning, or approvals. They are also the main exception to the normal stateless model. The request/response interface remains stateless, but the server explicitly persists task state and exposes progress or completion, rather than relying on sticky in-memory sessions.

This maps cleanly onto the Capability Square from the previous article — and prompts are where the split between the two human corners pays off the most:

Control Plane	Who Triggers at Runtime	Square Corner(s)	Strength
Tools	The LLM (model)	LLM	Intent interpretation, tool selection
Prompts	The business user	Business Analyst (authors) + Business User (triggers)	Workflow knowledge encoded once, invoked many times
Resources	The host application	Host + Server	Context management, data access

Prompts span both human corners of the square. The business analyst — the domain lead for workflow design — encodes an expert workflow into a prompt at design time. Engineers implement that workflow, and the platform team governs its deployment and control. The business user triggers it with one click at runtime. The prompt is literally the handoff artifact between the two humans: the analyst's workflow knowledge, packaged so a user doesn't need to recreate it every Monday morning. Tools, by contrast, sit under the LLM corner because the model's judgment determines when they are called. Resources sit at the boundary between the host application and the MCP server: the host decides when to pull a resource into context, but the server provides the content. This is the one primitive in which two actors collaborate without either human being directly in the loop, which partly explains why its ecosystem support lags behind that of tools and prompts. When all three control planes work together, the system covers every type of interaction: ad-hoc tasks (tools), structured workflows (prompts), and ambient context (resources). And because resource loading is application-dependent, the host may or may not inject the right resource at the right time — so an important role of prompt workflows is to explicitly load the relevant resources into context as part of the workflow definition. This ensures the LLM has the context it needs, regardless of what the host application decided to provide.

The Primitive You're Not Using

Most MCP servers expose tools. A growing number expose resources. Almost none expose prompts.

Browse the MCP ecosystem, the tutorial repositories, the example servers, and the community showcases, and you'll find tool after tool after tool. Prompts are either absent entirely or limited to trivial "system message" wrappers that add no value beyond what the user could type themselves. The MCP official blog didn't publish its first prompts-for-automation post until mid-2025, months after the protocol launched. The ecosystem followed suit: tools are easy to demo, prompts require thinking about workflows, and most tutorials took the easy path.

There's another reason prompts are underutilized: minimal SDK support. Most MCP SDKs treat prompts as simple message templates: you return a list of messages, and that's it. There's no built-in abstraction for multi-step workflows, data flow between steps, or hybrid execution where the server handles some steps and the LLM handles others. This is precisely why the PMCP (Pragmatic MCP) SDK added deep support for workflow prompts as an enterprise feature, the SequentialWorkflow abstraction we'll demonstrate in this article. Without SDK support, building reliable workflow prompts requires significant boilerplate that most teams don't invest in.

This is a missed opportunity. Prompts solve a reliability problem that tools cannot solve for known, repeatable workflows.

Consider the gap. When you leave a multi-step workflow entirely to the LLM, using only tools, you're relying on instruction-only orchestration: the LLM reads the tool descriptions, figures out the right sequence, handles data flow between steps, and decides which computations to delegate versus attempt. In our experience building production MCP servers with the PMCP SDK, testing multi-step workflows like report generation, data pipelines, and incident response across multiple LLM models, instruction-only approaches typically achieve 60-70% compliance for complex workflows. That means 30-40% of the time, the LLM gets something wrong: a step out of order, a calculation it shouldn't have attempted, a variable lost between tool calls.

Now compare hybrid execution, where the prompt defines the workflow, the server executes the deterministic steps, and the LLM fills in only where its language intelligence is needed. In the same test scenarios, hybrid execution typically achieves 85-95% compliance. These numbers come from internal benchmarks, not published studies, and will vary by model, workflow complexity, and domain, but the direction is consistent: reducing the LLM's decision space materially improves reliability.

The reason is straightforward. Prompts reduce the LLM's decision space and move workflow state management from the LLM's volatile context to explicit server-side execution state. In a multi-step tool chain, the LLM must track variables between calls, remember which step it's on, and pass results forward correctly, all in its context window, where information degrades with distance. In a workflow prompt, the server manages that state deterministically through request-scoped execution and, when necessary, explicitly persisted state. The LLM receives a pre-built plan with most steps already completed. It only needs to handle the parts that genuinely require language understanding: summarization, formatting, and inference.

The most common failure mode has a name: calculation hallucination. When an LLM sees a "calculate" tool and a "format" tool, it often skips the calculation tool to save a round trip and attempts the arithmetic itself. The result looks plausible, and the format is right; however, the numbers are wrong. Hybrid execution prevents this entirely: the server runs the calculation, the LLM never sees the raw numbers, and the result is correct by construction.

If you're measuring task completion across diverse requests, and you should be, as we argued in the tool design article, prompts are how you push completion rates from "usually works" to "reliably works" for your most common workflows.

From Protocol to Workflow

At the protocol level, a prompt is simple: the client calls prompts/get with a name and arguments, and the server returns a GetPromptResult containing a sequence of PromptMessage values. Each message has a role (System or User) and content (text, images, or embedded resources). The client uses these messages to populate the conversation and guide the LLM's response. Clients discover available prompts via prompts/list -- parallel to tools/list -- and present them to users as slash commands, menu items, or quick actions. The key difference from tools: the user explicitly selects them. There's no LLM reasoning about which prompt to invoke.

At this protocol level, prompts are message templates. Useful for setting up context, but not fundamentally different from what the user could type themselves. The real power emerges when you move from templates to workflows: multi-step processes in which data flows between steps, and the server executes what it can before handing off to the LLM. In a production deployment, that workflow engine should still fit the same remote, mostly stateless service model: deterministic steps execute within the request, and truly long-running work is broken out into explicit tasks.

An important distinction: base MCP prompts are message templates. The server-executed workflow behavior shown below is a PMCP SDK abstraction built on top of prompts, tools, and resources. It uses the prompt protocol as the entry point, but adds a workflow engine that executes deterministic steps server-side before returning the message sequence to the client. This is not part of the MCP spec -- it's what a well-designed SDK can layer on top of it.

The Weekly Sales Report: One Click, Complex Result

Here's the weekly sales report as a SequentialWorkflow -- a PMCP abstraction where each step can feed data into the next:

use pmcp::server::workflow::{
    dsl::{constant, field, from_step, prompt_arg},
    SequentialWorkflow, ToolHandle, WorkflowStep,
};
use serde_json::json;

// SequentialWorkflow: a multi-step prompt where data flows between steps.
// Unlike SyncPrompt (which builds static messages), SequentialWorkflow
// orchestrates tool calls with typed data bindings between steps.
let sales_report = SequentialWorkflow::new(
    "weekly_sales_report",
    "Generate a formatted weekly sales report with trends and key metrics."
)
// Arguments: what the user provides when triggering this prompt
.argument("week", "Week identifier (e.g., '2026-W12')", true)
.argument("format", "Output format: summary or detailed", false)

// Step 1: Query sales database (server executes -- deterministic)
// The server calls query_database with constant + user-provided args.
// No LLM needed: this is pure data retrieval.
.step(
    WorkflowStep::new("query_sales", ToolHandle::new("query_database"))
        .arg("query_type", constant(json!("weekly_sales")))
        .arg("week", prompt_arg("week"))
        .bind("sales_data")  // output available as "sales_data" for later steps
)

// Step 2: Aggregate by category (server executes -- deterministic)
// Uses the output from step 1. The server chains these automatically.
.step(
    WorkflowStep::new("aggregate", ToolHandle::new("aggregate_metrics"))
        .arg("data", from_step("sales_data"))  // entire output from step 1
        .arg("group_by", constant(json!("product_category")))
        .bind("aggregated")
)

// Step 3: Calculate week-over-week trends (server executes -- deterministic)
// This is the step that failed in our opening scenario when the LLM
// tried to do it. The server handles the arithmetic correctly every time.
.step(
    WorkflowStep::new("calc_trends", ToolHandle::new("calculate_trends"))
        .arg("current_week", from_step("aggregated"))
        .arg("comparison", constant(json!("previous_week")))
        .bind("trends")
)

// Step 4: Format as executive summary (LLM needed -- natural language)
// This step requires intelligence: choosing which metrics to highlight,
// writing prose summaries, deciding what "noteworthy" means.
// The server provides the data and guidance; the LLM provides the writing.
.step(
    WorkflowStep::new("format_report", ToolHandle::new("format_output"))
        .with_guidance(
            "Format the aggregated data into an executive summary for week {week}.\n\
             Highlight the top 3 performing categories and any \
             week-over-week trends that exceed 10% change.\n\
             Use the report template for consistent formatting."
        )
        .with_resource("template://reports/weekly-sales")
        .expect("Report template resource")
        .arg("data", from_step("aggregated"))
        .arg("trends", from_step("trends"))
        .arg("format", prompt_arg("format"))
        .bind("report")
);

Follow the data flow through the DSL helpers. prompt_arg("week") pulls the user-provided week into step 1. from_step("sales_data") feeds step 1's entire output into step 2. from_step("aggregated") chains step 2's result into step 3. Each bind("name") names a step's output, allowing subsequent steps to reference it. The data flows forward through the workflow without any LLM involvement in the plumbing.

Steps 1-3 are deterministic. The server executes them because each parameter can be resolved from prompt arguments (prompt_arg), constants (constant), or prior-step bindings (from_step). No judgment required. No natural language interpretation. Just data retrieval, aggregation, and arithmetic.

Step 4 is where the server stops and hands off. The format_output tool needs LLM intelligence for natural language summarization: choosing which metrics to highlight, writing prose, deciding what "noteworthy" means. The server provides everything the LLM needs -- the aggregated data (from steps 1-3), the guidance (what to highlight), and a report template resource. The LLM's job is reduced to writing.

Remember the opening scenario? The LLM tried to calculate week-over-week trends and got the arithmetic wrong—mixing up baselines and producing a report showing 340% growth in a category that actually declined. With this workflow, the server handles the arithmetic in step 3. Deterministically. Correctly. Every time. The LLM only enters at step 4, where its strength—natural language—is needed.

Registration ties the workflow to the server's existing tools:

Server::builder()
    .tool("query_database", query_db_tool)
    .tool("aggregate_metrics", aggregate_tool)
    .tool("calculate_trends", trends_tool)
    .tool("format_output", format_tool)
    .resources(report_templates)
    .prompt_workflow(sales_report)?  // validates bindings and registers as prompt
    .build()?
    .run_streamable_http("0.0.0.0:3000").await?;

Notice .prompt_workflow() validates the workflow's bindings at registration time. If you reference a binding that doesn't exist -- say, from_step("sales_data") with a typo -- you get an error at startup, not a runtime surprise when a user triggers the prompt. The tools you already built become the building blocks. The workflow just orchestrates them.

The user clicks one prompt. Three database operations, one aggregation, and one trend calculation happen server-side in milliseconds. The LLM receives the complete data and writes the summary. One click, complex result.

Partial Execution Plans: The Server Does What It Can

When a user invokes the weekly sales report prompt, the server doesn't just return instructions. It returns a partial execution plan: a conversation trace showing what was already done and what remains.

The server executed steps 1-3 and embedded the actual results. Here's a simplified version of what the client LLM receives:

Message 1 (User): "Generate weekly sales report for 2026-W12"
Message 2 (Assistant): "Plan: 1. Query sales DB  2. Aggregate  3. Calculate trends  4. Format"
Message 3 (Assistant): "Calling query_database..."
Message 4 (Tool Result): {"total_revenue": 284500, "transactions": 1247, ...}  ← PRE-EXECUTED by server
Message 5 (Assistant): "Calling aggregate_metrics..."
Message 6 (Tool Result): {"categories": [{"name": "Enterprise", "revenue": 142000}, ...]}  ← PRE-EXECUTED by server
Message 7 (Assistant): "Calling calculate_trends..."
Message 8 (Tool Result): {"enterprise": "+12%", "smb": "-3%", "startup": "+28%", ...}  ← PRE-EXECUTED by server
Message 9 (Assistant): "Format the aggregated data into an executive summary for 2026-W12..."
Message 10 (Resource): [weekly-sales template content]

Messages 1-8 are done. The tool results (Messages 4, 6, 8) were pre-executed by the server; the LLM didn't call those tools. It receives actual data —real revenue numbers, real category breakdowns, real trend percentages—not instructions to fetch that data. The server already queried the database, already aggregated, already calculated. The results are embedded in the conversation trace as if the tools had been called, but no LLM decision-making was involved.

Message 9 is the guidance for the remaining step. Message 10 is the resource template. The LLM's job is reduced to: take this data, follow this guidance, use this template, write prose. That's one decision (how to write the summary) instead of the dozens of decisions the instruction-only approach requires (which tools to call, in what order, how to handle errors, whether to do the arithmetic itself).

This is not a template. It's an execution plan where the server has already completed the deterministic portion. The distinction matters: a template says "do these steps." A partial execution plan says, "these steps are done and here are the results, now do the remaining steps." The LLM starts from step 4, not step 1.

This is the Capability Square operating at the workflow level. The server handles deterministic computation — its strength. The LLM handles natural language — its strength. The business analyst designed the workflow at design time, identifying which steps are deterministic and which require intelligence — their strength. And the business user invoked it at runtime with the specific parameters (the week, the service, the severity) that only they, living inside the working context, can provide — their strength. All four corners working together, not on a single tool call, but across an entire workflow.

The compliance improvement is consistent across our internal benchmarks. Instruction-only approaches, where the prompt simply says "follow these steps: 1. Query the sales DB, 2. Aggregate by category, 3. Calculate trends, 4. Format as a summary:" and leave every decision to the LLM. It might skip steps, reorder them, call different tools, or do the arithmetic itself (badly). Hybrid execution, where steps 1-3 are already done, and the LLM just needs to format, dramatically narrows the decision space. Far fewer decisions, far fewer failure points, far more reliable output.

Test this with your own workflows. Take a 4-step process that your team runs weekly. Build it as an instruction-only prompt, then as a SequentialWorkflow with hybrid execution. Run both 20 times. The difference in successful completions will make the case.

Incident Response: When the Server Needs the LLM

The sales report workflow was mostly deterministic: three server-executed steps, one LLM step. But not every workflow splits that cleanly. Consider incident response, where the server gathers data, but the LLM needs to do the hard part of synthesis and recommendation.

A 5-step incident response workflow:

Check service status (server executes -- API call, deterministic)
Pull recent error logs (server executes -- log query, deterministic)
Correlate with recent deployments (server executes -- git/deploy history lookup, deterministic)
Draft incident summary (LLM needed -- synthesis, pattern recognition, writing)
Suggest mitigation steps (LLM needed -- reasoning about root cause, recommending actions)

Here's the sketch -- not a full implementation, but enough to see the pattern:

SequentialWorkflow::new("incident_response", "Investigate and summarize a service incident")
    .argument("service", "Service name or ID", true)
    .argument("severity", "Severity level: P1, P2, P3", true)

    // Steps 1-3: Server handles (deterministic data gathering)
    .step(/* check_service_status -- server executes */)
    .step(/* query_error_logs -- server executes */)
    .step(/* check_recent_deploys -- server executes */)

    // Steps 4-5: LLM handles (intelligence required)
    .step(
        WorkflowStep::new("draft_summary", ToolHandle::new("create_incident_report"))
            .with_guidance(
                "Analyze the service status, error logs, and deployment history.\n\
                 Draft an incident summary for {service} at severity {severity}.\n\
                 Include: timeline, affected systems, error patterns, and \
                 correlation with recent deployments."
            )
            .arg("status", from_step("service_status"))
            .arg("logs", from_step("error_logs"))
            .arg("deploys", from_step("deploy_history"))
            .bind("summary")
    )
    .step(
        WorkflowStep::new("suggest_mitigation", ToolHandle::new("recommend_actions"))
            .with_guidance(
                "Based on the incident summary, suggest 2-3 mitigation steps.\n\
                 If the incident correlates with a recent deployment, include \
                 a rollback recommendation."
            )
            .arg("summary", from_step("summary"))
            .bind("recommendations")
    )

The split is different from the sales report. The sales report was a 3-step server, 1-step LLM—mostly deterministic. The incident response is 3 steps for the server; 2 steps for the LLM. The analysis and recommendation require genuine intelligence. But the constant is the same: the server gathers all the data the LLM needs before handing off. The LLM doesn't have to figure out which APIs to call or which logs to check. It receives the service status, error logs, and deployment history, then applies its strengths: synthesis and reasoning.

Notice that step 5 depends on step 4's output (from_step("summary")). The LLM executes both steps, but the data dependency is explicit in the workflow. The business analyst who designed this workflow decided that the mitigation suggestions should be based on the incident summary rather than the raw data. That's domain knowledge encoded in the workflow structure.

The partial execution plan for this workflow looks different, too. The server executes steps 1-3 and embeds the results. The LLM receives three steps' worth of data and two steps' worth of guidance. It drafts the summary, then uses that summary to suggest mitigations. The workflow is longer, the LLM does more, but the pattern is identical: the server handles the deterministic parts, the LLM handles the intelligence parts.

The Business Analyst's Playbook: Learning What Business Users Need

The weekly sales report and the incident response share something important: someone who understands the organization's workflows designed them. That someone is the business analyst — one of the two human corners of the Capability Square. In a strong enterprise setup, workflow design is domain-led, engineering-implemented, and platform-governed. The analyst shares a domain with the business users they're designing for, and their role doesn't end at tool design. It extends to workflow design: identifying which processes their business users run repeatedly, which steps are deterministic, and where the LLM's intelligence adds value.

The following diagram illustrates the benefits of adding workflow prompts to the MCP servers, as they dramatically reduce the effort for busy business people and significantly increase the completion rate of requests and their consistency:

Here's how to approach workflow prompt design in practice:

Observe your users. What tasks do they repeat weekly? Monthly? What multi-step processes do they describe as "the usual"? These are prompt candidates. Every Monday, the sales team generates a weekly report. Every time there's an outage, the ops team runs the same diagnostic sequence. Every quarter, the finance team reconciles accounts. These are not ad hoc tasks, as they are workflows that run on a schedule, with the same steps and for the same reasons.
Identify the deterministic core. For each repeating workflow, ask: which steps are always the same? Which steps require judgment? The always-the-same steps become server-executed workflow steps with constant() and from_step() bindings. The judgment steps become LLM-guided steps with .with_guidance(). The sales report's trend calculation is always the same arithmetic. The incident response's mitigation recommendation always requires judgment. The split is usually obvious once you look for it.
Start with one prompt. Don't build 20 prompts. Build the one prompt that saves the most time for the most users. Measure its completion rate. Iterate. This mirrors the tool design advice from the previous article: start with the 20% that serves 80%. For prompts, start with the one workflow your team runs most often.
Connect prompts to tools. Prompts don't replace tools -- they orchestrate them. Your existing tools become the building blocks of workflow prompts. A SequentialWorkflow's steps call your tools via ToolHandle. The query_database, aggregate_metrics, and calculate_trends tools existed independently before the sales report workflow was built. The workflow just wired them together with data flow and execution order.
Iterate based on failure modes. If the LLM consistently gets step N wrong, move step N to the server side. If the server can't handle step M because it requires judgment, move it to the LLM with clear guidance. The boundary between deterministic and intelligent steps is not fixed -- it's something you discover through observation and measurement.

The business analyst's role is to encode organizational knowledge into the MCP server — knowledge they are qualified to encode precisely because they share a domain with the business users who will invoke it. Tools encode individual capabilities. Prompts encode workflows — the sequences, the data flow, the decision about which steps need human-level intelligence and which don't. You know which workflows matter. You know which steps are deterministic. You know where the LLM's intelligence adds value. Encode that knowledge in prompts.

Track prompt invocation frequency and completion rates. A prompt that's invoked 50 times a week with 90% completion is saving your team hours of manual orchestration. A prompt that's never invoked is telling you something about your understanding of user needs. Both signals are useful -- one tells you what to optimize, the other tells you what to rethink.

None of this removes the need for security-by-design. Prompts are not "just UX." They package access to real systems and real workflows. The same controls apply here as in tools: per-request authn and authz, policy checks on downstream operations, audit logs, rate limits, secret isolation, and clear boundaries on which systems the workflow may touch. If a workflow includes code mode, the controls need to be tighter still: validate first, approve when the risk warrants it, and execute only within a constrained sandbox.

Resources: The Application-Controlled Plane

We've covered tools (model-controlled) and prompts (user-controlled). The third primitive is resources: application-controlled context that the host application pulls into the conversation.

Resources are read-only reference material -- documentation, schemas, configuration, templates. They provide context that helps agents make better decisions. Where tools perform actions and prompts orchestrate workflows, resources serve information on request. They are passive: the server publishes them, and the client or prompt reads them when needed.

Here's a resource using the PMCP SDK:

use pmcp::{StaticResource, ResourceCollection};

// Resources provide context data that agents can read before acting.
// Unlike tools (which perform actions) or prompts (which orchestrate workflows),
// resources are passive: they serve information on request.
let resources = ResourceCollection::new()
    .add_resource(
        StaticResource::new_text(
            "docs://sales/schema",
            "# Sales Database Schema\n\n\
             ## Tables\n\
             - `orders`: order_id, customer_id, total, created_at\n\
             - `products`: product_id, name, category, price\n\
             - `customers`: customer_id, name, email, segment\n\n\
             ## Common Queries\n\
             Weekly sales: GROUP BY date_trunc('week', created_at)\n\
             By category: JOIN products ON orders.product_id = products.product_id"
        )
        .with_name("Sales Database Schema")
        .with_description(
            "Database schema and common query patterns for the sales system. \
             Read this before constructing database queries."
        )
        .with_mime_type("text/markdown")
    );

URI design matters. Use scheme prefixes to organize your resources: docs:// for documentation, config:// for configuration, data:// for structured data, template:// for report and output templates. The URI is a stable identifier that clients and prompts reference -- docs://sales/schema tells both humans and agents what they'll find before reading it.

The .with_description() call serves the same purpose as tool descriptions: it helps agents decide whether a resource is relevant before reading its content. A well-described resource lets an agent skip resources it doesn't need, reducing unnecessary context in the conversation.

Notice how this connects to the weekly sales report prompt. In that workflow, step 4 used .with_resource("template://reports/weekly-sales") to fetch a report template and embed its content in the conversation trace. Resources provide the context that makes prompts more effective -- the LLM reads the schema to understand the data it's formatting, reads the template to follow the expected output structure. Resources and prompts are designed to work together.

The Ecosystem Reality Check

Resources are the least mature of the three MCP primitives in terms of client support. The spec defines them comprehensively -- annotations, subscriptions, URI templates, content types. The PMCP SDK supports them fully. But client implementations lag behind.

Most MCP clients implement the resources/list and resources/read protocol operations, but the user experience varies significantly. Claude Desktop requires users to explicitly select resources from a list. There is no standardized resource picker UI across clients. And critically, resource access is a client-side operation -- the LLM has no built-in way to request a resource the way it can call a tool. Unless the client proactively injects resources into context, or the server wraps resource access as a tool, the LLM never sees them.

The gap between spec and ecosystem is real. The MCP specification describes a rich resource system with subscriptions for change notifications, URI templates for parameterized access, and annotations for priority and freshness signals. In practice, most clients implement the basics (list and read) and skip the rest. If you build a resource-heavy server today, you're building ahead of client support.

This doesn't mean you shouldn't build resources. It means you should build them with realistic expectations about how they'll be consumed today, while designing for where the ecosystem is headed. The patterns in the next section bridge the gap.

Pragmatic Bridge Patterns: Making Resources Work Today

Four patterns let you get value from resources today, regardless of client support.

1. Wrap resources as tools (most reliable today). Instead of serving a resource at docs://sales/schema, create a get_sales_schema tool that returns the same content. The LLM discovers and calls tools reliably -- this is the pragmatic path when you need agents to access reference data without depending on client resource support.

// Bridge pattern: expose resource content as a tool.
// Until clients reliably handle resources, tools are the safe path.
.tool("get_sales_schema", /* returns the same content as docs://sales/schema */)

This isn't elegant, but it works everywhere. You can maintain both the resource (for clients that support it) and the tool wrapper (for clients that don't), serving the same underlying content through both channels.

2. Resource templates as parameterized access. URI templates like docs://reports/{report_type} let the server generate URIs from parameters. When clients support resource templates, they can offer auto-complete for resource URIs -- the user types docs://reports/ and sees available report types. This pattern is worth implementing now because it costs nothing extra and will work well as clients catch up.

3. Prompt-mediated resource loading. This is the pattern we already saw: .with_resource(uri) in SequentialWorkflow steps. The server fetches the resource during prompt execution and embeds it in the conversation. This works today because it doesn't depend on client resource support at all -- the server handles the resource loading internally, and the client just sees the content in the prompt messages.

4. Subscribe and automatic injection (future pattern). Clients can subscribe to resource changes via resources/subscribe. When the resource updates, the server sends a notification, and the client can refresh its context. This enables "always up-to-date context" without manual polling -- imagine an agent that automatically gets the latest API schema whenever it changes. This is where resources are headed. When client support catches up, automatic resource injection will make context management seamless.

Build your resources now. Use bridge patterns for today's clients. As the ecosystem matures, your resources will work natively -- and you'll already have the content, the URIs, and the descriptions in place.

Key Takeaways

Three control planes, three primitives. Tools are model-controlled (the LLM decides). Prompts are user-controlled (the human decides). Resources are application-controlled (the host decides). Knowing which to use is the first design decision for any MCP capability.
Prompts solve the workflow reliability problem. For known, repeatable workflows, hybrid execution -- where the server handles deterministic steps and the LLM handles intelligence -- consistently outperforms instruction-only orchestration in our benchmarks. Each party does what it's built for.
Partial execution plans are the key differentiator. A prompt doesn't just send instructions. It returns a conversation trace with completed tool results, guidance for remaining steps, and embedded resource content. The LLM receives data, not directions.
The business analyst designs workflows, not just tools. Observe which tasks your business users repeat. Identify the deterministic core. Package it as a SequentialWorkflow. Start with one prompt for your team's most common workflow and measure its completion rate. This is the handoff between the two human corners of the square: the analyst encodes once at design time, the business user triggers many times at runtime.
Resources are underbuilt but worth building. Client support is thin today. Use bridge patterns -- wrap as tools, prompt-mediated loading -- for immediate value. Design for where the ecosystem is going, and your resources will be ready when clients catch up.
Tasks are the explicit exception to the stateless rule. Most MCP interactions should stay stateless. When work outlives a single request, model it as a task with persisted state, progress tracking, and clear completion semantics instead of smuggling session state into the server process.
Prompts and tools are complementary. Prompts orchestrate tools. Your existing tools become the building blocks of workflow prompts. Good tool design (from the previous article) makes good prompt design possible.
Measure prompt completion rates. Track invocation frequency and success across diverse users. If a prompt is never invoked, your understanding of user needs may be wrong. If it fails consistently at step N, move step N server-side. Both signals guide iteration.

Continue the Series

This article covered prompts and resources -- the primitives that turn individual tools into reliable workflows and ambient context. The rest of the series goes deeper.

Want to test your server? See Testing MCP Servers for unit testing, integration testing, and description quality validation.
Concerned about security? MCP Security covers OAuth 2.1, input validation, and the common vulnerabilities that affect MCP servers.
Building from an existing API spec? Schema-Driven MCP Servers shows the generate-then-prune workflow for going from OpenAPI spec to curated tool set.
Need interactive UI? MCP Apps covers building MCP Apps with UI widgets for rich agent experiences.
Interested in code mode? Code Mode for MCP explores the two-step validate_code then execute_code flow, including policy analysis, risk scoring, human approval, and sandboxed execution for the long tail of requests.
Need long-running execution? Tasks for MCP covers the explicit task model for work that should not happen inside a single stateless request.

For hands-on practice with these patterns, the Advanced MCP course provides guided exercises building production MCP servers in Rust with the PMCP SDK.

MCP Tool Design: Why Your AI Agent Is Failing (And How to Fix It)

Guy — Wed, 18 Mar 2026 23:57:57 +0000

The Reports of MCP's Death Have Been Greatly Exaggerated

Scroll through developer forums in early 2026, and you'll find a recurring theme: MCP is dead. The takes range from dismissive ("just a fad") to resigned ("we tried it, our agents kept failing"). And the frustrations behind them are real. Teams are building MCP servers with 50+ tools, watching their agents stumble through tool selection, and concluding that the protocol itself is broken.

It isn't. MCP isn't dead; it's being used poorly. And the evidence for how to use it well is now overwhelming.

Over the past year, teams at GitHub, Block, and dozens of smaller shops have converged on the same set of principles. GitHub Copilot cut its tool count from 40 to 13 and saw measurable benchmark improvements. Block rebuilt its Linear MCP server three times, going from 30+ tools to just 2. The pattern is consistent: fewer tools, better descriptions, outcome-oriented design. The problem isn't the protocol. It's tool design.

This article lays out the framework. We'll start with the mental model that makes everything else click, the Capability Square, then walk through the anatomy of a well-designed tool. Subsequent articles in this series cover the quantitative evidence, description quality, and the anti-patterns that cause most failures.

What Is MCP? (The 30-Second Version)

The Model Context Protocol (MCP) is an open protocol that connects AI models to external tools and data sources. The simplest way to think about it: websites and mobile apps are the interface between humans and online services. MCP is the interface between AI and those same services. Over decades, we've invested heavily in improving human interfaces, including the iPhone's gesture language, years of UX research, accessibility standards, and usability testing. AI needs the same investment in its interface to online services. MCP is that interface, and tool design is its UX discipline.

One of our clients came to us with exactly this gap. They wanted AI agents to operate their web forms: filling in fields, clicking buttons, navigating multi-step workflows through a browser. They asked us to run tests evaluating how well browser-based agents could complete their online forms, and to help "fix" the forms for agent compatibility. We explained that this was significant effort in the wrong direction. Their web forms were designed for humans, with visual layout, hover states, drag-and-drop interactions. Instead, we showed them that adding an MCP server to the same API sitting behind those forms gave AI agents a native interface purpose-built for how they work: structured inputs, clear descriptions, typed responses. The agents went from struggling with form fields to completing tasks reliably. The lesson: don't retrofit human interfaces for AI. Build AI-native interfaces alongside them, and MCP servers to your internal and external services.

The parallels between UX design and MCP tool design run deep. Decades of UX research have produced principles that transfer directly:

Affordance - the idea that a door handle should look pullable, maps to tool names and parameter descriptions: if a field is named id but requires a UUID, the affordance is broken.

Recognition over recall - it's easier to pick from a list than type from memory, maps to using enums and example values in schemas so the LLM recognizes valid inputs instead of guessing.

Visibility of system status - users need feedback when something goes wrong, maps to error messages that explain what happened and how to fix it, rather than a cryptic "invalid input." These aren't metaphors. They're the same design discipline applied to a different kind of user.

The Capability Square: Four Parties, One Tool

Even if you've been building MCP servers for months, don't skip this section. The Capability Square reframes tool design around two parties that most MCP discussions collapse into one or ignore entirely: the business analyst who designs the server, and the business user who actually invokes it. Both are domain experts — they know the business the server operates in — but they show up at different times. The analyst shows up at design time, encoding domain knowledge into tool names, descriptions, and schemas. The business user shows up at runtime, asking the questions the server was built to answer. Every MCP tool sits at the intersection of four parties, each with distinct strengths and weaknesses. Understanding this balance is the foundation of good tool design.

The LLM (MCP Client)

A large language model (LLM) is part of each MCP client, such as ChatGPT, Claude Desktop, or a custom agent. It brings language understanding, reasoning, and tool-calling intelligence. It's good at interpreting ambiguous user requests ("where's my package?"), choosing between available tools, composing multi-step plans, and recovering gracefully from errors.

What it's bad at: domain knowledge and symbolic computation. An LLM doesn't know which API capabilities matter for your specific users, and it can't access your databases. It doesn't know that your customer support team needs order tracking, but never touches inventory management. It doesn't know your compliance requirements or your business rules.

The MCP Server

The server provides symbolic computation, data access, and validated operations. It's good at precise calculations, database queries, API calls with proper authentication, input validation, and returning structured results. It runs deterministically, and it is more predictable and easier to validate than LLM reasoning.

What it's bad at: understanding user intent. A server can't interpret "check if we have enough widgets for the Johnson order" without a tool specifically designed for that workflow. It doesn't adapt to ambiguity. It does exactly what it's told, nothing more.

The Business Analyst (Design-Time Domain Expert)

This is the party that's most often overlooked, and it's the one that shapes everything the LLM ever sees. Critically, this is not a technical role. The best server designer is not the software developer, not the platform engineer, not the IT team — it is a domain-expert business analyst: the product manager, the operations lead, the subject-matter expert, or the analyst who sits closest to users and their workflows. They may pair with engineers who implement the server, but the design decisions — which tools to expose, what to name them, how to describe them, when they should be used — belong to someone fluent in the business domain, not the underlying API. Technical fluency is not a substitute for domain fluency, and handing MCP design to the team that happens to own the codebase is one of the most common and expensive mistakes in this space.

The business analyst brings knowledge that neither the LLM nor the server possesses. They know which 20% of an API serves 80% of actual requests. They understand the user personas and their existing processes. They know the vocabulary their users speak in, the business rules that constrain what "correct" means, and the compliance requirements that shape the edges. They know that the customer support team needs order tracking, but never touches inventory management. They know the business context.

What they're bad at: being present at runtime. The business analyst's knowledge has to be encoded into the tool's name, description, schema, and error messages before the first business user ever connects. Every design choice is a message to the LLM about how to use the tool.

But "not present at runtime" doesn't mean "design it and walk away." Tool design is iterative. Your first design is a hypothesis about what your business users need, and like any hypothesis, it needs validation. Usage logs tell you which tools are called, which fail, which are never used, and which requests produce no tool match at all. The business analyst reviews these logs and refines: renaming tools that confuse the LLM, improving descriptions that lead to wrong selections, and adding tools for workflows that users need but the initial design missed.

This iterative loop is where MCP shines compared to direct API integration. Changing a tool's name, description, or input schema is a server-side change, with no client updates, no SDK version bumps, no breaking changes propagated to consumers. The MCP protocol decouples tool discovery from tool invocation, so the LLM rediscovers the improved schema on the next connection. This makes the feedback cycle fast: observe failures, update the tool design, deploy, and measure again. Teams that treat tool design as a one-time exercise miss the biggest advantage of having MCP in the middle. You should put effort into designing the MCP server correctly, as "You never get a second chance to make a first impression." However, you should continue to monitor the MCP server usage logs to adjust to the usage patterns of real business users.

The Business User (Runtime Domain Expert)

The business user is the person who actually opens the MCP client and asks the question. They share the business analyst's domain — a customer support rep, a warehouse manager, a financial analyst, a clinician, an operations planner — but their expertise shows up at runtime, not design time. They bring the one thing no one else in the square possesses when a request is actually being made: the specific intent behind this specific request, in this specific business context, right now. "The Johnson order." "The Q3 reconciliation." "The East Coast warehouse." These references mean nothing to the LLM or the server on their own; they only make sense because the business user lives inside a working context that the design-time parties can't fully predict.

What they're good at: knowing what they actually want, recognizing a wrong answer when they see it, and framing requests in domain language. What they're bad at — and what a well-designed server should protect them from — are the things the analyst has already solved for them: they shouldn't have to know which tool to pick, which parameters to format, or which API endpoint underlies the answer. A well-designed MCP server makes the business user's domain fluency sufficient; they describe the outcome in their own vocabulary, and the system handles the rest.

This is why the two human corners of the square must share a domain. If the business analyst designs tools for a persona they don't understand, no amount of schema polish will save the server: the vocabulary will be wrong, the outcomes won't match real requests, and the "obvious" tool for a given question won't exist. The tightest MCP servers are those where the analyst either is a business user (dogfooding) or spends significant time watching them work. The feedback channel between the two human corners — usage logs, failed requests, "no tool matched" events — is the mechanism that keeps them aligned as users and workflows evolve.

Why the Square Matters

Each corner of the square compensates for the others' weaknesses, and each lives at a different point in time. At design time, the business analyst encodes domain context into the server's tools, descriptions, and schemas — knowledge that neither the LLM nor the server possesses on its own. At runtime, the business user brings the specific intent behind a specific request — the thing the analyst could not predict in advance. The LLM translates that intent into a tool call, interpreting ambiguity that the server can't. The server executes with precision, but the LLM can't match. No single corner can carry the system; remove any one, and task completion collapses.

This has a practical consequence that trips up most teams: the same API should produce different MCP servers for different business users.

Consider the London Transit API. A daily commuter wants trip planning: "fastest route from Paddington to Canary Wharf, avoiding the Jubilee line." An event organizer wants logistics: "How many bus routes serve Wembley Stadium, and what's the last departure after a 10 PM concert?" A municipal planner wants a construction impact analysis: "If we close three stations on the Northern line for six weeks, which bus routes need capacity increases?"

Same API. Three completely different MCP servers. Three different sets of tools, with different names, different descriptions, and different response shapes, because the business analyst for each server shares a domain with their business users and knows how those users actually frame their requests.

Here's the key insight: when you ask an LLM to auto-wrap an API, it lacks this domain context. It can't know which 20% matters because it doesn't know who the user is. Auto-generated MCP servers produce generic tool sets that serve no one well. The business analyst's judgment — encoded in tool selection, naming, and descriptions — is what makes an MCP server effective, and that judgment exists only because the analyst understands the business, not just the API.

How do you know your square is balanced? Measure task completion across the specific requests your business users actually make, not only the three test cases you tried during development. If completion is low, one corner is weak:

The LLM can't understand your tools → fix names, descriptions, and schemas.
The server can't handle the requests → add or redesign tools, or move orchestration server-side.
The business analyst chose the wrong tools to expose → the design doesn't match what users actually ask for; re-observe the real workflows and re-prioritize.
The business user's vocabulary doesn't match the server's → the analyst built for a different persona, or the shared-domain assumption was wrong, and a technical team ended up making design calls they weren't qualified to make.

Tool Anatomy: What Makes an MCP Tool

An MCP tool has six components: a name, a description, an input schema, an output schema, a handler, and error handling. Each one is a communication channel to the LLM, and each one matters.

Here's a complete tool written in Rust that uses the PMCP SDK. Don't worry if you're not fluent in Rust, as the comments walk through every important line:

// -- Dependencies --
// pmcp: the PMCP SDK for building MCP servers
// serde: serialization/deserialization (parses JSON input, formats JSON output)
// schemars: generates JSON Schema from Rust types (so the LLM knows what to send)
use pmcp::server::typed_tool::TypedToolWithOutput;
use pmcp::RequestHandlerExtra;
use serde::{Deserialize, Serialize};
use schemars::JsonSchema;

// -- Input Schema --
// This struct defines what the LLM must send. Each field becomes a property
// in the JSON Schema that the LLM sees when it discovers this tool.
// The doc comments (///) become the schema descriptions automatically.
//
// Annotations on each field define constraints that flow into the
// JSON Schema. The LLM sees "maxLength": 16 on the SKU field and
// "minimum": 1 on quantity BEFORE it calls the tool. A well-behaved
// client respects these; the server enforces them at runtime too.
// deny_unknown_fields rejects any extra fields the LLM might add.
#[derive(Debug, Deserialize, JsonSchema)]
#[schemars(deny_unknown_fields)]
struct CheckInventoryInput {
    /// Product SKU to look up (e.g., "WIDGET-42", "BOLT-7")
    #[schemars(length(max = 16))]
    sku: String,

    /// Number of items needed. Defaults to 1 if not specified.
    /// Use this to check whether a specific quantity is available
    /// before quoting delivery dates.
    #[serde(default = "default_quantity")]
    #[schemars(range(min = 1, max = 10000))]
    quantity_needed: u32,
}

// Default value: if the LLM doesn't specify a quantity, assume 1
fn default_quantity() -> u32 { 1 }

// -- Output Schema --
// Defining the output shape serves two purposes:
// 1. The LLM knows exactly what fields to expect in the response
// 2. Downstream tools or MCP Apps can rely on this structure
#[derive(Debug, Serialize, JsonSchema)]
struct InventoryResult {
    /// The product SKU that was checked
    sku: String,
    /// Whether the requested quantity is currently in stock
    in_stock: bool,
    /// Total quantity available in warehouse
    available: u32,
    /// Whether the requested quantity can be fulfilled
    sufficient: bool,
}

// -- Register the tool with the server --
#[tokio::main]
async fn main() -> anyhow::Result<()> {
    let server = pmcp::ServerBuilder::new()
        .name("inventory-server")
        .version("1.0.0")
        // Register the tool: name, handler with typed input and output,
        // plus a description that tells the LLM WHAT it does, WHEN to
        // use it, and what it RETURNS.
        .tool(
            "check_inventory",
            TypedToolWithOutput::new(
                "check_inventory",
                |input: CheckInventoryInput, _extra: RequestHandlerExtra| {
                    Box::pin(async move {
                        // In production, this queries your inventory database.
                        // Here we return a mock response for clarity.
                        let available = 847_u32;
                        Ok(InventoryResult {
                            sku: input.sku,
                            in_stock: available > 0,
                            available,
                            sufficient: available >= input.quantity_needed,
                        })
                    })
                },
            )
            .with_description(
                "Check inventory levels for a product by SKU. Returns stock \
                 status, available quantity, and whether the requested amount \
                 can be fulfilled. Use this before quoting delivery dates \
                 to customers."
            ),
        )
        .build()?;

    // Start the server over Streamable HTTP -- the production transport.
    // This makes your server accessible to any MCP client over the network:
    // Claude Desktop, ChatGPT, custom agents, or browser-based tools.
    // Unlike stdio (which requires local installation), HTTP lets
    // non-technical users connect without touching a terminal.
    server.run_streamable_http("0.0.0.0:3000").await?;
    Ok(())
}

While we're using Rust and the PMCP SDK throughout this series, the design principles, mainly typed schemas, descriptive names, and structured output, apply to any MCP-compliant server, whether TypeScript, Python, or anything else that speaks the protocol. These are protocol-level concerns, not language-level ones.

Let's walk through each component.

Name ("check_inventory"): The name follows a verb_noun pattern. It's unambiguous, and the LLM won't mistake it for a tool for updating inventory or listing products. Avoid generic names like get_data or process_request. The name is the LLM's first signal about what a tool does.

Description: This is the LLM's primary decision surface. Notice it does three things: says what the tool does ("check inventory levels"), what it returns ("stock status, available quantity, and whether the requested amount can be fulfilled"), which helps the LLM understand if the tool can answer the user's request, and when to use it ("before quoting delivery dates to customers"). That last part is critical. It tells the LLM about the workflow context, which the description's author — the business analyst — knows but the LLM doesn't.

Input schema: The CheckInventoryInput struct defines what the LLM must send. Each field has a type (the LLM can't accidentally pass a string where a number is expected), a doc comment that becomes the JSON Schema description (the LLM sees "Product SKU to look up" when it discovers the tool), and optional defaults (quantity_needed defaults to 1 if omitted). The #[schemars(...)] annotations are the single source of truth for constraints: length(max = 16) on the SKU field generates "maxLength": 16 in the JSON Schema, and range(min = 1, max = 10000) on quantity generates "minimum": 1, "maximum": 10000. The LLM sees these rules when it discovers the tool, before it ever makes a call. And #[schemars(deny_unknown_fields)] on the struct means the LLM can't sneak in extra fields, as anything outside sku and quantity_needed is rejected.

Output schema: The InventoryResult struct defines what the tool returns. This is optional in the MCP spec, but we strongly recommend it. A defined output schema serves two purposes: the LLM knows exactly what fields to expect (it won't hallucinate response fields that don't exist), and downstream consumers, whether another tool in a chain or an MCP App rendering a UI widget, can rely on the structure. The sufficient field is a good example: it performs the comparison server-side rather than asking the LLM to compare available against quantity_needed, which risks getting it wrong.

Handler: The async closure that does the actual work. In this example, it returns a mock response for clarity. In production, this would query your inventory database, call a warehouse API, or perform whatever computation the tool promises. Notice that the handler receives a typed CheckInventoryInput and not raw JSON. The parsing already happened. Your handler code focuses on business logic, not input validation. This is the server's contribution to the Capability Square: reliable, deterministic execution.

Validation: Notice that constraints are declared once, on the struct fields, using #[schemars(...)] annotations. The same annotation serves two purposes: it generates the JSON Schema that the LLM reads at discovery time, and it defines the contract the server enforces at runtime. No duplication between schema and validation logic, where the struct is the single source of truth.

Security in MCP servers operates in layers, and schema constraints are among the easiest to add. First, serde enforces type safety: sku must be a string, quantity_needed must be an unsigned integer, and type-level attacks are blocked at deserialization before your code runs. Second, #[schemars(length(max = 16))] constrains input shape: it won't prevent SQL injection on its own (that's the job of parameterized queries and safe query construction in your database layer), but it does reject obviously malformed or abusive input early, before it reaches any downstream system. Real SKUs are short; a 200-character string is either a mistake or a probe, and there's no reason to let it through. Third, deny_unknown_fields prevents unexpected fields from slipping past the schema entirely. Each layer is simple, but together they significantly reduce the attack surface. The deeper security story, such as parameterized queries, OAuth 2.1, Rust's memory safety guarantees, and the OWASP MCP threat model, gets its own article later in this series.

Error handling: If the LLM sends input that doesn't match CheckInventoryInput, such as, passing "sku": 42 instead of "sku": "WIDGET-42", serde produces an error message explaining the type mismatch. If the SKU exceeds 16 characters, the schema constraint rejects it before the handler runs. For business logic errors inside the handler, use pmcp::Error::validation() with actionable messages following a three-part template: what went wrong, what was expected, and an example of correct input. Good error messages suggest one or two specific fixes, since multiple options force the LLM to guess, and guessing wastes tokens and user patience.

Notice this isn't a local development tool. This is a server designed for a specific business user who needs to quote delivery dates. The business analyst decided that inventory checks matter for their users, and encoded that context into the description and the output shape. The sufficient field exists because the business analyst knows that customers ask "do you have enough?" not "how many do you have?" A different business analyst building for a warehouse manager might expose entirely different tools from the same inventory system.

Can an LLM discover this tool and call it correctly on the first try? If not, your name or description needs work. That's the simplest measurement of tool design quality, and it's one you can test in five minutes with any MCP client.

Outcomes, Not Operations

The business analyst in the Capability Square knows something the LLM never will: what outcome the business user actually wants. When a customer asks "where's my order?", they don't want a customer ID, then a list of order IDs, then a status lookup. They want a tracking link and an ETA. The difference between those two experiences is the difference between operation-oriented and outcome-oriented tool design.

Here's the anti-pattern. A team with a REST background wraps their existing endpoints as MCP tools:

get_customer_by_email(email) returns a customer_id
list_customer_orders(customer_id) returns an array of order_id values
get_order_status(order_id) returns a status string

To answer "where's my order?", the LLM must chain all three calls in the correct sequence. The costs compound at every step:

More tokens. The LLM processes the full response from each tool call and generates the next call. Three round-trips mean three times as many input and output tokens, which is a cost the user incurs without getting any additional value.
More latency. Each step requires a network round-trip to the MCP server plus LLM processing time to interpret the result and formulate the next call. What could be a sub-second single call becomes a multi-second chain.
Growing risk of misstep. The probability of a correct sequence is the product of each step's success rate. If each tool call has a 95% chance of correct execution, three chained calls drop to 85.7%. At five steps, you're at 77.4%. The LLM must remember variable names and values from earlier calls, handle edge cases at each step, and maintain coherence across the full chain. Each step is another opportunity for the model to hallucinate a parameter, misinterpret a response, or lose track of its plan.

	Operation-Oriented (REST style)	Outcome-Oriented (MCP style)
Tool Count	High (1 per endpoint)	Low (1 per user goal)
LLM Effort	High (choreographing multi-step chains)	Low (single-shot invocation)
Token Cost	High (processing every intermediate result)	Low (one request, one response)
Latency	High (N round trips + N LLM inferences)	Low (single round trip)
Reliability	Low (3+ compounding points of failure)	High (deterministic server logic)

Now consider the outcome-oriented alternative:

// -- Input: just the customer's email --
#[derive(Debug, Deserialize, JsonSchema)]
struct TrackOrderInput {
    /// Customer email address (e.g., "alice@company.com")
    #[schemars(length(max = 254))]
    email: String,
}

// -- Status enum: the LLM sees valid values in the schema --
// Instead of a free-form string, an enum lets the LLM 
// "recognize" valid statuses rather than "recall" them from 
// memory.
#[derive(Debug, Serialize, JsonSchema)]
#[serde(rename_all = "snake_case")]
enum OrderStatus {
    Processing,
    Shipped,
    InTransit,
    Delivered,
}

// -- Output: everything the LLM needs to answer the question --
#[derive(Debug, Serialize, JsonSchema)]
struct OrderTrackingResult {
    /// Customer name for the greeting
    customer: String,
    /// Order identifier
    order_id: String,
    /// Current order status
    status: OrderStatus,
    /// Shipping carrier name
    carrier: String,
    /// Estimated delivery date (ISO 8601)
    eta: String,
    /// Direct tracking URL the customer can click
    tracking_url: String,
}

The tool registration follows the same pattern as check_inventory:

.tool(
    "track_latest_order",
    TypedToolWithOutput::new(
        "track_latest_order",
        |input: TrackOrderInput, _extra: RequestHandlerExtra| {
            Box::pin(async move {
                // Internally: resolve customer, find latest order, get status.
                // The server handles the entire chain -- three API calls
                // collapsed into one deterministic operation.
                Ok(OrderTrackingResult {
                    customer: "Alice Chen".into(),
                    order_id: "ORD-8834".into(),
                    status: OrderStatus::InTransit,
                    carrier: "FedEx".into(),
                    eta: "2026-03-20".into(),
                    tracking_url: "https://fedex.com/track/ABC123".into(),
                })
            })
        },
    )
    .with_description(
        "Track the most recent order for a customer using their email. \
         Returns order status, carrier info, and tracking link. Use this \
         when a customer asks 'where is my order?' or 'when will it arrive?'"
    ),
)

One tool. One user outcome. The output struct gives the LLM a rich, typed response, with customer name, status, carrier, ETA, and a clickable tracking URL, which is everything it needs to answer the question in a single turn. The server handles the chaining internally (resolve customer, find latest order, fetch status) because that's what servers are good at: deterministic, multi-step computation. In a production environment, your server handles requests from business users who don't know MCP exists and don't care about your API structure. They just want answers. In the Capability Square, symbolic computation and data access are the server's strengths. Let the server do the work it's built for, and let the LLM do what it's built for: understanding the user's intent and presenting a clear answer.

This isn't a theoretical pattern. Block built 60+ production MCP servers. Their Linear integration started with 30+ tools that mirrored GraphQL endpoints, with one tool per query and one per mutation. After three iterations, they were down to 2 tools. The tool count dropped because the team learned to design for outcomes. Each iteration moved complexity from the LLM (which had to choreograph multi-tool sequences) into the server (which could handle the orchestration deterministically).

Measurement point: Test this yourself. Give 10 users the same task ("find my latest order status"). With the 3-tool REST mapping, measure how many succeed on the first try. Now try the single outcome-oriented tool. The difference in task completion rates is your design-quality signal.

Less Is More: The Evidence for Tool Reduction

Outcome-oriented design naturally reduces tool count. But how much does reduction actually matter? The research is unambiguous.

GitHub reduced their Copilot MCP integration from 40 built-in tools to 13 core tools. The result: 2 to 5 percentage point improvement across SWE-Lancer and SWEbench-Verified benchmarks, plus a 400ms latency reduction. Fewer tools meant the model spent less time on tool selection and more time on the actual task. The gains came not from adding capability, but from removing it.

The Speakeasy team ran a controlled experiment using a Pet Store API. At 107 tools, both large and small models failed completely, and task success collapsed. At 20 tools, large models scored 19 out of 20 correct. At 10 tools, performance was perfect. The failure wasn't gradual. It was a cliff: past a threshold, models don't degrade gracefully. They fall off.

Why does success collapse rather than degrade? Two mechanisms compound. First, context window bloat: every tool name, description, and parameter schema consumes tokens on every request. At 50+ tools, this can eat 5 to 7 percent of the model's context before a single user message arrives, thus crowding out conversation history, document content, and reasoning space. Second, and more insidious, is tool hallucination: when the LLM's attention is spread across too many similar-sounding tools, it starts inventing nonexistent tool names, conflating parameters between tools, or calling the right tool while using arguments from a different tool's schema. This is the same "instruction following degradation" that causes LLMs to drift off-task in long prompts, except here, each hallucinated tool call is a hard failure, not a soft one. The model doesn't produce a slightly wrong answer. It produces no answer at all.

In UX terms, this is information overload. Just as a human can't choose from a menu of 100 items without decision fatigue, an LLM's attention fragments across too many similar-sounding options. The threshold varies by model size. Small models (8B parameters) hit their sweet spot around 19 tools and fail at 46. Even the largest models struggle past 100 tools.

As Hugging Face's Phil Schmid puts it: "Curate ruthlessly. 5 to 15 tools per server. One server, one job."

This raises an obvious question: if you expose only 10 to 15 tools, aren't you leaving functionality on the table? Yes! and deliberately. And that's the right choice. We'll see why shortly, when we look at how much of an API your users actually need.

Measurement point: Count your tools. If you have more than 15 per server, you're likely past the diminishing returns threshold. Benchmark your task completion rate before and after pruning, and the numbers will make the case for you.

The 97% Problem: Tool Description Quality

You can have the right number of tools, designed for the right outcomes, and still fail. A 2025 study analyzing MCP tool descriptions across the ecosystem found that 97.1% contain at least one quality issue. More than half (56%) have unclear purpose statements. Your tools might be well-designed, but if the LLM can't understand when to use them, that design is invisible.

Tool descriptions are not documentation. They are the LLM's primary decision surface. When the LLM sees 15 tools and must choose one, the description is the only signal it has. A vague description is like a restaurant menu that says "food" for every dish, which is technically accurate, but practically useless.

The research identified six components of a quality tool description: Purpose (what the tool does), Guidelines (when and how to use it), Limitations (what it cannot do or when to use something else), Parameter Explanation (input format and constraints), Length (enough detail without overwhelming), and Examples (concrete usage scenarios). Most descriptions fail on multiple components simultaneously.

Here's what the improvement looks like in practice. Consider a flight search tool across three levels of description quality:

// LEVEL 1 -- Vague (56% of MCP tools have this problem)
.with_description("Search for flights")

// LEVEL 2 -- Better purpose, but missing guidelines and limitations
.with_description("Search for available flights between two airports on a given date")

// LEVEL 3 -- Full rubric: purpose + guidelines + limitations
.with_description(
    "Search for available flights between two airports on a specific date. \
     Returns up to 20 results sorted by price. Use 3-letter IATA airport \
     codes (e.g., 'LAX', 'JFK'). Only searches economy class. For business \
     or first class, use the premium_flight_search tool. Dates must be \
     within the next 330 days."
)

Level 1 tells the LLM nothing about parameters, constraints, or when to use an alternative tool. The LLM has to guess at everything: the input format, the result shape, and the scope. Level 2 adds purpose: the LLM knows it needs two airports and a date, but it doesn't know about airport code formats, result limits, or class restrictions. It might pass "Los Angeles" instead of "LAX", or ask for business-class flights and get the wrong results. Level 3 gives the LLM everything it needs to (a) decide to use this tool, (b) provide correct inputs, and (c) know when NOT to use it, that last point being critical for multi-tool servers where the LLM must choose between similar options.

In the same study, augmented descriptions improved task success by 5.85 percentage points in controlled testing. That may sound modest, but at scale it's the difference between a tool that works most of the time and one that works almost all of the time. For a customer-facing agent handling thousands of requests per day, those percentage points represent real users getting real answers.

Description quality extends to error messages. When a tool receives invalid input, the error message is the LLM's only guide for recovery. Compare these two approaches:

// BAD: LLM tries random fixes
return Err(pmcp::Error::validation("Invalid input"));

// GOOD: Problem + expectation + example
return Err(pmcp::Error::validation(
    "Invalid date format for 'departure': '15/04/2026'. \
     Use ISO 8601 format (YYYY-MM-DD). \
     Example: '2026-04-15'"
));

The first error forces the LLM to guess. It might try a different date format, or remove the date entirely, or change a different parameter. Each wrong guess wastes a round trip and user patience. The second error follows a three-part template: what went wrong ("invalid date format for 'departure'"), what was expected ("ISO 8601 format"), and an example of correct input ("2026-04-15"). Suggest one or two fixes maximum. Multiple options force the LLM to guess, and guessing is what we're trying to eliminate.

This is where the typed struct pattern we saw in the Tool Anatomy section pays off. Remember how CheckInventoryInput used doc comments on each field to generate JSON Schema descriptions? The same pattern applies to every tool. When the business analyst writes /// Product SKU to look up (e.g., "WIDGET-42", "BOLT-7") on a struct field, that text becomes the LLM's guide for formatting its input. The type system enforces correctness at parse time, before the handler code ever runs. And the output schema tells the LLM exactly what fields to expect, so it won't hallucinate response fields that don't exist.

This connects back to the Capability Square. The business analyst writes these descriptions at design time, capturing what the business user will eventually ask for—in their own words. The LLM reads the descriptions at runtime and translates the user's phrasing into a tool call. The server validates the call against the same schema the analyst authored. All four corners aligned, with the analyst's design-time knowledge guiding the LLM's runtime decisions and the server's runtime enforcement, in service of a business user who never has to see any of it.

Measurement point: Test your description quality by presenting your tool list to an LLM and asking it to select the right tool for 10 different user requests. If tool selection accuracy is below 90%, your descriptions need work. This test takes five minutes and tells you more about your MCP server's real-world effectiveness than any benchmark.

The Full-API Trap (And the Pareto Escape)

We said that exposing only 10 to 15 tools means leaving functionality on the table. Now let's talk about why that's the right call.

The most tempting mistake in MCP server design is wrapping your entire API. You have 200 endpoints, so you generate 200 tools. The OpenAPI-to-MCP converter makes it easy. The result is a server that does everything and succeeds at nothing. The LLM sees 200 tool descriptions, burns through context window space parsing them, and still picks the wrong one, because 200 options is not a menu, it's a phone book.

The deeper problem is semantic noise. When you auto-wrap an API, you inject your backend's implementation details into the LLM's reasoning space. The LLM shouldn't have to understand your database normalization, your internal microservice boundaries, or your pagination cursor format. It should see tools that map cleanly to user intent. Auto-wrapping exposes tools like get_customer_by_internal_id and list_orders_with_cursor_pagination, which exist because of how your backend is built, not because of what your users need. Every implementation-detail tool is noise that the LLM must parse, evaluate, and reject before it can find the tool that actually answers the user's question.

The Pareto Escape

The way out is the 80/20 rule. In practice, roughly 20% of an API's capabilities serve 80% of user requests. The business analyst — one of the two human corners of the Capability Square — is the person who knows which 20%, because they share a domain with the business users they're designing for.

The left side of the curve is where your MCP tools live: the high-frequency request types that your users ask for every day. These are the 10 to 15 outcome-oriented tools you design carefully, with typed schemas, quality descriptions, and validation constraints. They handle the bulk of traffic reliably and fast.

The right side is the long tail: rare, unpredictable requests that don't justify a dedicated tool. Creating tools for every edge case pushes you back into the 50+ tool zone where LLM performance collapses. The Pareto line is where you stop adding tools and start thinking about a different mechanism for everything to its right.

Block has built over 60 production MCP servers. Their consistent finding: the generate-then-prune workflow is standard practice. Generate from your API spec, then ruthlessly cut. Most teams end up keeping 10 to 15 percent of what they started with. The tools that survive are those that map to actual user outcomes, and identifying those outcomes requires a business analyst who shares a domain with business users.

Measurement point: Provide your MCP server to 20 business users who represent your target persona. Track task completion rates across their actual requests, and not your limited test cases, their requests. If completion is below 80%, you're either exposing too many tools (confusion) or the wrong tools (coverage gap). The Capability Square tells you which: if the LLM selects wrong tools, fix descriptions. If the right tool doesn't exist, the business analyst chose the wrong 20% — re-observe the users. If the tool exists but returns unhelpful results, fix the server's implementation.

For the Other 80%: A Preview of Code Mode

So you've curated your tools to the critical 20%. But users will inevitably ask for something outside that set. What then?

This is where code mode enters. Instead of creating a tool for every possible request, you let the LLM write code that calls your API directly. Anthropic's engineering team found that code execution reduced token usage from 150,000 to 2,000 tokens (98.7% reduction) in a Google Drive to Salesforce workflow. The LLM writes a targeted script, executes it, and returns just the result. No 200-tool context window bloat. No multi-step tool chaining. Just precise, one-shot computation.

The playbook: design 10 to 15 outcome-oriented tools for the common 80% of requests. For the long tail, provide code mode access with appropriate guardrails. This gives you broad coverage without the tool count explosion that kills LLM performance. Your curated tools handle the predictable workflows fast. Code mode handles the unpredictable ones flexibly.

We'll cover code mode in depth in a later article in this series: how to set it up, how to secure it, and when it's the right (and wrong) choice. For now, the key insight is that tool reduction isn't about limiting your users. It's about choosing the right mechanism for each type of request.

Key Takeaways

The Capability Square drives everything. Good tool design requires balancing four parties: what the LLM can do (interpret intent, select tools), what the server should do (precise computation, data access), what the business analyst knows at design time (which capabilities matter, how to describe them, which user persona they serve), and what the business user brings at runtime (the actual request and its business context). When any corner is weak, task completion suffers.
Design for outcomes, not operations. One tool per user goal, not one tool per API endpoint. The customer asking "where's my order?" wants a tracking link, not three chained API calls. Move orchestration complexity into the server, where it runs deterministically.
Less is more. Keep servers to 5 to 15 tools. Evidence from GitHub Copilot, Speakeasy, and Block consistently shows that performance degrades sharply past 20 tools. The failure is not gradual; it's a cliff.
Descriptions are the user interface. Use the six-component rubric: Purpose, Guidelines, Limitations, Parameters, Length, and Examples. With 97% of tool descriptions containing quality issues, this is a big opportunity for immediate improvement.
Error messages are recovery instructions. Use the three-part template: what went wrong, what was expected, and an example of correct input. Suggest one or two fixes, not five. Ambiguity in error messages wastes round trips and user patience.
Know your users — and design from inside their domain. The business analyst corner of the Capability Square determines which 20% of the API capabilities to expose. The same API should produce different MCP servers for different business user personas. Auto-wrapping skips this judgment and produces servers that serve no one well. Handing the design to a purely technical team — engineers who don't live in the business domain — produces the same failure mode for the same reason: design calls made without domain fluency.
Measure task completion across diverse business user requests. Not your three test cases during development, but real requests from real business users representing your target persona. If completion is low, the Capability Square tells you which corner to fix.

Continue the Series

This article covered the foundation: how to design MCP tools that LLMs can actually use. The rest of the series goes deeper.

Want to add user-controlled workflows? Read our article on Prompts and Resources, where we cover MCP's underutilized primitives for guided interactions.
Ready to test your server? See Testing MCP Servers for unit testing, integration testing, and description quality validation.
Concerned about security? MCP Security covers OAuth 2.1, input validation, and the common vulnerabilities that affect 43% of MCP servers.
Building from an existing API spec? Schema-Driven MCP Servers shows the generate-then-prune workflow in detail, from OpenAPI spec to curated tool set.
Interested in code mode? Code Mode for MCP explores the long-tail strategy we previewed above: how to let the LLM write code safely against your API.

For hands-on practice with these patterns, the Advanced MCP course provides guided exercises building production MCP servers in Rust with the PMCP SDK.

Building a Successful Modern Data Analytics Platform in the Cloud

Guy — Sun, 20 Oct 2019 16:27:36 +0000

I worked with dozens of companies migrating their legacy data warehouses or analytical databases to the cloud. I saw the difficulty to let go of the monolithic thinking and design and to benefit from the modern cloud architecture fully. In this article, I’ll share my pattern for a scalable, flexible, and cost-effective data analytics platform in the AWS cloud, which was successfully implemented in these companies.
TL;DR, design the data platform with three layers, L1 with raw files data, L2 with optimized files data, and L3 with cache in mind. Ingest the data as it comes into L1, and transform each use-case independently into L2, and when a specific access pattern demands it, cache some of the data into a dedicated data store.

Mistake 1: “One Data Store To Rule Them All”

The main difference that companies are facing when modernizing their existing data analytics platform is giving up on a single database that was used in their legacy system. It is hard to give on it after the massive investment of building it and operating it. I met companies that spent millions of dollars and hundreds of years of development to built their data warehouse and the many ETL processes, stored procedures, and reporting tools that are part of it. It is also hard to give up on the benefits that a single tool is giving in terms of “a single neck to choke,” or answers to “where is the (analytical) data that I need?”.
A few days ago, Amazon.com announced that they finally shut down the last Oracle database in their retail business. It was a long process that was running for more than four years. My first role as a solutions architect for Amazon.com was to help with the design of the migration away from relational databases in general and Oracle specifically. I worked with dozens of teams across the business to re-design their systems from the classical relational to the more scalable and flexible newer data stores. The goal was to shift to NoSQL (DynamoDB mainly) or analytical (Amazon Redshift was the main target then) databases. It was hard for the teams to give up on the easy life of being able to query (or search, as they called it) on every column, to use standard query language as SQL for all their data needs, and mainly to use the tools they were familiar with. However, Amazon.com took the long term perspective and decided to invest in building infrastructure that is (almost infinitely) scalable. They wanted to be able to grow their business without technical limitations.
During these years, Amazon.com, which is famous for its “simplify and invent” principle, built many tools to make this migration easier. They also built, using AWS, a set of new databases that can be used as targets, mainly Amazon Aurora (almost drop-in replacement for Oracle with its PostgreSQL flavor), and Amazon Athena, which we will discuss shortly.
The limitations of a single tool are also apparent to many companies, in terms of flexibility, scale, cost, agility, and others that are part of modern architecture in the cloud. However, the break down or curve out of a monolithic system is painful for most companies. Therefore, many companies desire to replace a non-scalable and expensive on-premises database, such as Oracle or MS-SQL, with a cloud service, such as Amazon Redshift (data warehouse service) or Amazon Athena (managed Presto service), Azure Databricks (managed Spark service) or Google BigQuery. They are hoping that the single cloud service will replace the single monolithic on-prem database. Sadly, this is often a disappointment, as the limitation is on using a single tool and not only on where they are operating.

Mistake 2: “Hadoop is dead, long live the new king — Spark.”

Every about five years, new technology is coming along and changing the way to build a modern architecture. Ten years ago, it was Hadoop that opened up scalable opportunities to handle a large amount of data with tools such as Hive, Pig, HBase, and others. Five years ago, it was Spark that changed the game with much faster big data processing, better SQL than Hive, newer functional programming languages with Scala and Python than Hadoop’s Java, new streaming capabilities, and many others.
Spark is also enjoying the maturity of the tools and the popularity among many big data developers. The combination of running Spark SQL, Spark Streaming, and even machine learning with Spark MLlib is very appealing, and many companies have standardized their big data on Spark. However, the growth of the popularity and need for data analytics and machine learning exposed the limitations of Spark. As a Spark expert, I’m often asked to come to review and fix the Spark code that is too complex or too slow as it grows. I also see many companies trying to build their machine learning using the Spark library, which Databricks is developing and pushing a lot.
My recommendation now is to write the data transformation logic using SQL based on PrestoDB. SQL has many benefits compared to the Scala or Python of Spark, mainly in its concise form, fewer bugs that can sneak into the code, and many more people who can write the logic using it. The main objection I get is based on the resistance of the current developers who are less comfortable with SQL than with Scala or Python they are using today.

Mistake 3: “So, Presto is the new king.”

The term modern cloud architecture is referring to an architecture that is based on microservices, serverless, and pay-for-what-you-use (and not pay-for-what-you-provision). The poster boy of this modern architecture is AWS Lambda (or Azure Functions/Google Cloud Functions). You write the business logic, and the cloud is managing the rest for you. No more Application servers, no more starting, and terminating servers or virtual machines, no more waiting for the yearly product release, and no more “only Java in production.” The future is thousands of functions developed as needed and executed when needed calling one another in a perfect mesh of business logic and scaled up and down just-in-time.
Amazon Athena is the serverless option when it comes to data. The service is currently running a managed PrestoDB engine. The reason for the “currently” modifier in the previous sentence is to allow Amazon to upgrade the engine based on the best SQL engine on files in S3 at the time. The evolution from Hive, Impala, SparkSQL, and now Presto only proves that we will see in the future an even better engine. Amazon wants to avoid the mistake they did with naming the EMR (used to be Elastic-Map-Reduce) service, which is running today more complex distributed computing than Map-Reduce.
In Amazon Athena, you write your business logic using SQL, and the query is sent to a fleet of workers that are optimized to meet the complexity of the data and the query. In most cases, in a few seconds, you have the query result in a CSV file in S3. No servers to manage, no time to wait for the servers to spin up, and no payment for idle machines. Real serverless.
However, I hear often that Amazon Athena is too expensive, especially when you are running a lot of heavy analytical queries. I listened to the same comments on AWS Lambda. The move from paying once for the resource (cluster of Presto or application server for business logic) to pay-for-what-you-use can be scary and risky. The secret is to know how to optimize your usage, which is usually much harder and less appealing when you are managing your resources.
As the cost of Amazon Athena is based on the amount of data scanned by the query, every reduction in the size of the data reduces the cost of the query. The main mistake in using Athena is using the fact that it can query raw data in huge volume and raw formats like JSON or CSV, and relying on it for too many queries.

Then, what do you recommend?

Let’s summarize what did we learn so far. We learned that we shouldn’t use only a single datastore as in time, it will limit our ability to grow the data usage. We learned that we should be curious and learn, test, and adopt new tools and technology, once they mature from the “nice idea” stage. We should take the long term perspective when designing our technical systems, to allow unlimited business growth for our company.
With this background and vision, we can better explain why do we spend so much effort on the following data tiers, instead of merely dropping everything into a super database _____ (fill in your current favorite database).
I see that you are now ready to see my recommended recipe.

Tier I (L1) — Raw Data in low-cost storage (such as S3)

All data should land in its raw form from every source with little modification or filtering. The data can come from IoT devices, streaming sources such as Kafka or Kinesis, textual log files, JSON payload from NoSQL or web services interactions, images, videos, textual comments, Excel or CSV files from partners, or anything that you would like one day to analyze and learn from.
The data should NOT be organized nicely with foreign keys between tables, or harmonized to have the same format of the address or product ID. Harmony is not part of tier I, and this is critical to make the system flexible enough to grow. Too many data projects are failing because they take too long to organize all the data without knowing which of the analysis on the data can give significant business values. Thus, Failing to award more investment into the data analytics platform.

Tier II (L2) — Multiple optimized data derivatives still in low-cost storage

The second tier is built gradually from the data coming into the first tier above. The second tier is starting as soon as the first file lands in the first tier, and it will evolve as more and more data is coming in. The evolution will be directed based on the data availability, and mainly based on the business usage of the analysis output, such as machine learning predictions or analytical reports. Let’s briefly discuss each part of this tier description:
Multiple — every analytical use case should have its own dedicated and independent flow of data. Even if it means that data will be replicated dozens of times and be calculated differently. Remember that every analysis is looking at the data from a different angle and for a different business need, and eventually, it is developed by a different team. For example, sales analytics is different than marketing analytics or logistics analytics.
Optimized — the transformations of the data from its raw form toward an analytical insight allow tremendous optimization opportunities. An obvious one is taking JSON data and storing it in Parquet format that is both columnar (query on a single column only scans the data of that column) and compressed. In most such transformation, using Create Table As Select (CTAS) in Athena, you can get 1,000–10,000 times cost improvements. The same goes for transcribing audio and video to textual captions or analyzing images to classes, faces sentiments, or face recognition. Running analytical queries on the face sentiments of your customers on different days or stores should be simple and low cost to be used by the business people.
Data Derivatives — The data in the second tier is mostly aggregated, filtered, or transformed from its original raw form to fit a specific business question. If I need to predict the daily sales of a brand, I don’t need to analyze every individual purchase for every unique product. I can look at daily and brand aggregation. We should not be afraid to make the derivative “too specific,” as we still have the raw data in Tier I. We will have many other specific derivatives to the other business use cases. Having the “same” data in different forms is not a problem, as this is not the same data, but a derivative of it.
Still in low-cost storage — if you want to be able to keep dozens of “copies” of the already big data that you have in your company, each “copy” of that data must be very low cost. I saw too many companies trying to work only with raw data (“because we can”) or write too quickly into a database with expensive compute and memory capabilities, and miss on this critical tier II.

Tier III (L3) — Optional Cache Datastores

To allow users’ interactions with the results of the data analytics, we often need to cache these results to make them usable for humans, in terms of speed and query capabilities.
The most recommended cache options (and obviously, there are more than one as each is better for different use cases) are:

DynamoDB for GraphQL access from a client application,
ElasticSearch for textual queries,
Redis for fast operations on in-memory data sets (such as Sorted-Sets), or
Neptune for graph queries (not to be confused with GraphQL).

It is also common to cache into a relational database (such as MySQL or Aurora PostgreSQL), which can be OK for relatively small data sets and visualization or BI tools that know how to work only with such databases.
As long as you treat it as cache, understanding that it much more expensive and therefore is used only for the actual use cases of the users, and you can always recreate it or delete as needed, you will have the required flexibility and cost-effectiveness that you need to build your analytical use cases within your organization. It takes time to transform companies to be “smarter” and use data more efficiently, and this time must be planned for agility, cost, scale, and simplicity, which the above architecture provides.

Operation: Orchestration and Monitoring

Running such a multi-tier, multi-use-cases, multi-line-of-business, multi-data-store, and other multipliers are not something that you can do manually with a single DBA or even a team of DBAs. You have to have automation in mind from the very beginning of the project.
In many companies, the practices of DevOps already started to evolve, and capabilities around micro-services, containers, continuous integration, and deployment (CI/CD) are already emerging. The migration to the cloud is also a growing interest, plan, and sometimes even execution and contributes to the IT power to support this modern architecture. Nevertheless, the ability to do DataOps efficiently is hard and new to most organizations. The agile and evolving building of the new architecture must include an essential aspect of people skills and choosing the right tools to automate the process.
The main options for orchestration I see today being used most often, are AWS native (Step Functions), Open-Source tools (mainly Apache AirFlow), or managed services (such as Upsolver). I have excellent experience with all these options, and the decision on which way to go is based on your specific use case, data sources, budget, technical capabilities, etc.

Let’s put it all together

The diagram below is often too overwhelming when I show it for the first time, and this is why I kept it only to the end. I hope that after reading the explanations and the reasons for it, you will find it more useful and straightforward to understand.

The diagram shows a specific project I’ve implemented in one of my customers, and the usage of Step Functions for orchestration, DataDog for monitoring or Terraform for deployment, can be replaced with any of your favorite tools (AirFlow, Grafana, and Jenkins, for example). The central concept of the cloud is the modularity of the architectures and the ability to add, replace, scale, and remove any part of it when needed by the business. As long as you are curious and able to learn new and better technologies, in the rapid pace of technological advancements we live in, you can build and operate a powerful and modern data platform. This data platform is an essential part of the digital and AI transformation of every company that wants to stay relevant and competitive today.