Forem: Anjaiah Methuku

Stop Flying Blind: We Built an LLM Evaluation Framework That Works Across 17+ Agent Frameworks

Anjaiah Methuku — Sun, 24 May 2026 20:35:40 +0000

Let me be brutally honest with you.

I've seen teams demo AI agents that look incredible — smooth responses, beautiful UI, stakeholders impressed. Then that same team ships to production and spends the next three weeks firefighting hallucinations they could have caught in testing.

The problem isn't the AI. The problem is nobody evaluated it properly.

Not because they didn't want to. Because the existing tools made it painful.

You're building with LangGraph on Monday. LlamaIndex RAG pipeline on Wednesday. The product team wants CrewAI by Friday. Every framework has different output shapes. Every eval tool wants you to rebuild your stack around it.

So you ship anyway. With fingers crossed.

That's the exact problem I set out to solve with Custom Evals.

What Is Custom Evals?

Custom Evals is an open-source, lightweight evaluation framework for LLM outputs with support for 17+ agent frameworks and a multi-layer metric system — from fast deterministic checks to full LLM-as-judge scoring.

pip install -e ".[dev]"

That's it. No required backend. No dashboard to stand up. No mandatory test runner.

Here's your first evaluation in 10 lines:

from custom.evals import CoherenceEvaluator
from custom.evals.llm import LLM

llm = LLM(provider="openai", model="gpt-4o-mini")
evaluator = CoherenceEvaluator(llm)

score = evaluator.evaluate({
    "input": "What is AI?",
    "output": "AI is artificial intelligence, enabling machines to perform intelligent tasks."
})

print(f"{score.label}: {score.explanation}")
# coherent: The response provides a clear, logical explanation...

A Score object. A label. An explanation. That's the entire interface.

Why Existing Tools Leave Gaps

I want to be fair here — the existing eval tools are genuinely good. But they each have a niche.

Phoenix Evals (Arize) is brilliant if you're deep in the Arize observability ecosystem. The Custom Evals architecture is openly inspired by it. But Phoenix is a full observability platform. If you just want to score outputs without standing up a tracing infrastructure, it's overkill.

DeepEval has 50+ metrics — impressive. But it requires a specific test runner, a specific file format, and an opinionated workflow. It's a comprehensive evaluation suite, not a lightweight library.

RAGAS is surgical and excellent at RAG evaluation specifically. Faithfulness, AnswerRelevancy, ContextPrecision — the research is solid. But it's RAG-first. It doesn't cover general LLM evaluation, agent tool-use quality, or document extraction accuracy.

The gap: none of them give you a single unified interface that works across 17 different frameworks without requiring a backend.

The Architecture: Four Evaluation Layers

The interesting design choice in Custom Evals is that there's no single "evaluator." There are four distinct layers. Use any or all of them.

Custom Evals
├── Layer 1: Code-Based Metrics       (deterministic, zero LLM cost)
├── Layer 2: LLM-Based Evaluators     (LLM-as-judge, semantic quality)
├── Layer 3: NLP Similarity Metrics   (BLEU, ROUGE, cosine, Jaro-Winkler...)
└── Layer 4: OCR / Document Metrics   (for non-LLM extraction pipelines)

Layer 1 — Deterministic Checks (Free & Fast)

No LLM call. No latency. No API cost. Just math.

from custom.evals.metrics import exact_match, sentiment_score

score = exact_match({"output": "Paris", "expected": "Paris"})
# Score(score=1.0, label="exact_match")

score = sentiment_score({"output": "The product is absolutely fantastic!"})
# Score(score=0.9, label="positive")

Want to register your own in 3 lines? Use the decorator:

from custom.evals import create_evaluator, Score

@create_evaluator(name="json_validity", direction="maximize")
def json_validity(output: str) -> Score:
    import json
    try:
        json.loads(output)
        return Score(score=1.0, label="valid", name="json_validity")
    except:
        return Score(score=0.0, label="invalid", name="json_validity")

Layer 2 — LLM-as-Judge (The Semantic Layer)

Four production-ready evaluators ship out of the box:

Evaluator	What It Measures	Needs Ground Truth?
`HallucinationEvaluator`	Does output contradict its context?	No
`CorrectnessEvaluator`	Factually correct vs expected answer?	Yes
`RelevanceEvaluator`	Does it actually answer the question?	No
`CoherenceEvaluator`	Logical flow and internal consistency?	No

Plus two RAG-specific ones: FaithfulnessEvaluator and AnswerRelevancyEvaluator.

One subtle but important detail — every evaluator declares a DIRECTION:

class HallucinationEvaluator(LLMEvaluator):
    DIRECTION = "minimize"  # Lower score = less hallucination = better ✅

class CoherenceEvaluator(LLMEvaluator):
    DIRECTION = "maximize"  # Higher score = more coherent = better ✅

This means your test thresholds work correctly regardless of metric semantics. You don't need to remember "is higher hallucination score good or bad?" — the evaluator tells you.

Layer 3 — NLP Similarity Metrics

Seven industry-standard metrics for reference-based comparison, no LLM required:

BLEU Score — N-gram precision, the machine translation standard
ROUGE-N / ROUGE-L — Recall-oriented overlap, the summarization gold standard
Jaro-Winkler — Edit distance with prefix weighting, great for entity matching
Dice Coefficient — Bigram overlap, fast and symmetric
Token F1 Score — Precision/recall at the token level
Cosine Similarity (TF-IDF) — Vector-space document comparison

from custom.evals.metrics import bleu_score, rouge_n, cosine_similarity_tfidf

result = bleu_score({
    "output": "The model predicts outcomes accurately",
    "expected": "The model accurately predicts outcomes"
})
print(result.score)  # 0.71
print(result.metadata)  # {"brevity_penalty": 1.0, "n_gram_precisions": [...]}

All seven return the same standardized Score object. Mix and match freely.

Layer 4 — Document Extraction & OCR Metrics

This is the most underrated part of the framework. Not everything you evaluate is an LLM.

If AWS Textract, Google Cloud Vision, or Azure Form Recognizer is in your pipeline, you need evaluation metrics for those outputs too:

text_extraction_accuracy — Fuzzy sequence similarity
character_error_rate (CER) — Standard OCR benchmarking metric
word_error_rate (WER) — Used in document parsing and speech-to-text
bounding_box_iou — Intersection over Union for spatial accuracy
field_extraction_f1 — Precision/recall for structured form fields

from custom.evals.metrics import text_extraction_accuracy, character_error_rate, bounding_box_iou

eval_input = {
    "output": "Invoice Date: 12/31/2025\nTotal: $1,234.56",
    "expected": "Invoice Date: 12/31/2025\nTotal: $1,234.56",
    "output_bbox": {"x": 10, "y": 20, "width": 100, "height": 30},
    "expected_bbox": {"x": 10, "y": 20, "width": 100, "height": 30}
}

print(f"Accuracy: {text_extraction_accuracy(eval_input).score:.2%}")  # 100.00%
print(f"CER: {character_error_rate(eval_input).metadata['raw_cer']:.2%}")  # 0.00%
print(f"IoU: {bounding_box_iou(eval_input).score:.2f}")                   # 1.00

None of the pure-LLM evaluation frameworks address this. Custom Evals does.

17+ Framework Integrations — The Full Picture

The pattern is the same across every framework:

# 1. Run your framework-specific agent (different per framework)
result = your_agent.run(query)
response = extract_response(result)  # framework-specific extraction

# 2. Evaluate with Custom Evals (always the same)
eval_input = {
    "input": query,
    "output": response,
    "context": relevant_context,  # optional
    "expected": ground_truth_answer  # optional
}

score = evaluator.evaluate(eval_input)

The eval_input dict is the universal adapter. Every integration reduces to filling this dict.

Here's a quick tour of what's covered:

☁️ Cloud Platforms

AWS Strands Agents (Bedrock + Claude)
Google ADK (Gemini 1.5 Flash/Pro)
Databricks Agent Bricks SDK (native MLflow experiment tracking included)

🏢 Microsoft Ecosystem

Microsoft Agent Framework
Semantic Kernel (plugin output boundaries)
Autogen (individual turns and full conversation outcomes)

🔗 LangChain & LlamaIndex

LangGraph (stateful graph evaluation)
LlamaIndex Workflows (event-driven hooks)
LangChain RAG + LlamaIndex RAG (full faithfulness/relevancy stack)

🤖 OpenAI

OpenAI Agents Framework (tool calls, handoffs)
OpenAI Agents SDK (function calling, structured outputs)
OpenAI Assistants (threads and run-based responses)
OpenAI Swarm (experimental multi-agent)

🌍 Community Frameworks

Agno (multi-agent at scale)
CrewAI (role-based agent outputs)
Pydantic AI (type-safe, structured outputs)

A Real Production Pipeline (Async + Concurrent)

Here's what running evaluation at scale actually looks like:

import asyncio
from custom.evals import (
    HallucinationEvaluator,
    FaithfulnessEvaluator,
    AnswerRelevancyEvaluator,
    RelevanceEvaluator
)
from custom.evals.llm import LLM
from custom.evals.metrics import bleu_score, rouge_n

async def evaluate_rag_batch(queries, rag_pipeline):
    llm = LLM(provider="openai", model="gpt-4o-mini")
    evaluators = {
        "hallucination": HallucinationEvaluator(llm),
        "faithfulness": FaithfulnessEvaluator(llm),
        "answer_relevancy": AnswerRelevancyEvaluator(llm),
        "relevance": RelevanceEvaluator(llm),
    }
    results = []

    for query in queries:
        response = rag_pipeline.query(query.text)
        eval_input = {
            "input": query.text,
            "output": response.answer,
            "context": "\n".join(response.source_nodes),
            "expected": query.expected_answer
        }

        # Run all LLM evaluations concurrently — not one-by-one
        llm_scores = await asyncio.gather(*[
            evaluators[name].evaluate_async(eval_input)
            for name in evaluators
        ])

        row = {"query": query.text}
        for name, score in zip(evaluators.keys(), llm_scores):
            row[f"{name}_score"] = score.score
            row[f"{name}_label"] = score.label

        # Add deterministic metrics at zero cost
        if query.expected_answer:
            row["bleu"] = bleu_score(eval_input).score
            row["rouge_1"] = rouge_n(eval_input).score

        results.append(row)

    return results

results = asyncio.run(evaluate_rag_batch(test_queries, my_rag_pipeline))

# Aggregate and report
import statistics
faithfulness_scores = [r["faithfulness_score"] for r in results]
print(f"Mean Faithfulness: {statistics.mean(faithfulness_scores):.3f}")
print(f"Failing: {sum(1 for s in faithfulness_scores if s < 0.7)}/{len(results)}")

Concurrent async LLM calls + fast deterministic checks. That's how you run evaluation at scale without LLM serial bottlenecks.

The Ground Truth Problem (And How It's Handled)

Here's a question most eval frameworks dodge: what if I don't have ground truth?

In production, users ask unpredictable questions. You can't pre-label every possible answer. Custom Evals explicitly handles three scenarios:

Reference-free (no ground truth needed):

# Hallucination only requires context, not an expected answer
score = hallucination_eval.evaluate({
    "input": "Who wrote Hamlet?",
    "output": "Shakespeare wrote Hamlet in 1600.",
    "context": "William Shakespeare wrote Hamlet circa 1600."
    # No "expected" key — still meaningful evaluation
})

Soft ground truth (intent-based):

# Answer relevancy checks if the answer addresses the question's intent
score = answer_relevancy_eval.evaluate({
    "input": "Who wrote Hamlet?",
    "output": "Shakespeare wrote Hamlet in 1600."
    # No expected answer — evaluates relevance to the question
})

Hard ground truth (known correct answers):

# Full correctness + BLEU/ROUGE when you have labeled data
score = correctness_eval.evaluate({
    "input": "Who wrote Hamlet?",
    "output": "Shakespeare wrote Hamlet in 1600.",
    "expected": "William Shakespeare"
})

This matters. Evaluation infrastructure that only works with labeled datasets is evaluation infrastructure you won't actually use in production.

Observability: Beyond Individual Scores

Custom Evals integrates with Phoenix Tracing (Arize) for production monitoring. One initialization line instruments everything:

from custom.evals import initialize_tracing

initialize_tracing(
    phoenix_endpoint="http://localhost:6006/v1/traces",
    metrics_enabled=True,
    metrics_export_interval=30000  # export every 30 seconds
)

After this, every evaluator call automatically:

Creates an OpenTelemetry span with timing + attributes
Increments evaluation counters
Records score distributions
Computes P50/P95/P99 latency histograms

Real-time dashboards showing score distributions over time — proactive monitoring instead of reactive debugging.

Framework Comparison

Feature	Custom Evals	Phoenix Evals	DeepEval	RAGAS
Installation	Low friction	Medium	Medium	Low
Agent framework support	17+	Limited	Limited	Limited
LLM-as-judge metrics	6	Many	50+	4
Deterministic NLP metrics	10+	Few	Few	Few
OCR/Document evaluation	✅	❌	❌	❌
OpenTelemetry tracing	Optional	Core	❌	❌
No backend required	✅	❌	✅	✅

The honest take: Custom Evals isn't trying to replace DeepEval or RAGAS. It's designed to be the evaluation layer you can plug into any stack. Run it alongside DeepEval for deeper metric coverage. Run it alongside Phoenix for full observability. It's composable by design.

Get Started in 5 Minutes

# Step 1: Install
pip install -e ".[dev]"

# Step 2: Set your API key
export OPENAI_API_KEY="sk-..."

# Step 3: Run your first real evaluation
from custom.evals import HallucinationEvaluator
from custom.evals.llm import LLM

llm = LLM(provider="openai", model="gpt-4o-mini")
evaluator = HallucinationEvaluator(llm)

score = evaluator.evaluate({
    "input": "What year was Python created?",
    "output": "Python was created in 1991 by Guido van Rossum.",
    "context": "Python was created in 1991 by Guido van Rossum."
})

print(f"Result: {score.label}")        # factual
print(f"Score: {score.score}")         # 0.0 (no hallucination = good)
print(f"Reason: {score.explanation}")

# Step 4: Add free deterministic checks
from custom.evals.metrics import bleu_score, rouge_n

for metric in [bleu_score, rouge_n]:
    result = metric({"output": my_answer, "expected": ground_truth})
    print(f"{result.name}: {result.score:.3f}")

What's Coming Next

The roadmap has some meaningful additions in the pipeline:

Context Precision & Recall — The two RAGAS metrics that complete the standard RAG evaluation quadrant
Safety Metrics — Bias and toxicity detection
Agentic Metrics — Tool call correctness, task completion rate, step efficiency for multi-agent systems
Extended Provider Support — Cohere, Mistral, Ollama (the strategy pattern makes this straightforward)

Wrapping Up

The LLM evaluation space is fragmented. Teams are building on different stacks. Frameworks produce different output shapes. Use cases demand different metrics.

Custom Evals is an honest acknowledgment of that fragmentation — and a practical response to it.

It won't be the only eval library you ever use. It will be the one you can actually drop into any stack without rebuilding your infrastructure around it.

Because in a world where you're choosing between 17 agent frameworks on any given sprint, having a single evaluation interface that works across all of them isn't a nice-to-have.

It's the difference between knowing your agent works and hoping it does.

Resources & Links

🔗 GitHub: anjijava16/cust-evals
📖 Framework Index: docs/FRAMEWORK_INDEX.md — all 17+ integrations
🚀 Quick Start: guides/QUICKSTART.md
📄 Non-LLM Evaluation: docs/NON_LLM_EVALUATION_GUIDE.md — OCR and Textract guide
📊 Advanced Metrics: docs/ADVANCED_METRICS_GUIDE.md — BLEU, ROUGE, and beyond

Building evaluation pipelines for AI systems? What metrics have you found most actionable in production? Drop your experience in the comments — genuinely curious what's working for others.

Deep Dive: Connecting AI to Snowflake with Model Context Protocol (MCP)

Anjaiah Methuku — Sun, 17 May 2026 01:31:24 +0000

The Model Context Protocol (MCP) lets AI assistants like Claude talk directly to Snowflake in real time — no custom API glue needed. This guide covers architecture patterns, RSA key-pair auth, Snowflake RBAC setup, production-tested SQL query patterns, and a full deployment checklist.

By anjijava16 · GitHub: mcp_servers

Data teams spend hours writing SQL queries, pivoting spreadsheets, and waiting for analysts to pull numbers. What if your AI assistant could talk directly to your Snowflake data warehouse — securely, in real time, with full natural language support?

That's exactly what the Model Context Protocol (MCP) enables. In this deep dive, we'll go far beyond a basic setup guide and explore architecture decisions, security hardening, real-world query patterns, performance tuning, and production deployment for an MCP-Snowflake integration.

Key Takeaways

MCP is a universal, open-standard adapter layer between any LLM and external data sources
Three deployment patterns exist: local stdio, SSE server, and cloud-hosted gateway
RSA key-pair auth is strongly preferred over passwords in production
A dedicated, minimal-permission Snowflake role limits blast radius if credentials are compromised
Tool filtering (--exclude_tools) prevents the AI from running writes or DDL

What is the Model Context Protocol (MCP)?

MCP is an open standard created by Anthropic that defines how AI systems communicate with external tools, data sources, and APIs. Think of it as a universal adapter layer — like USB-C for AI integrations.

┌──────────────────────────────────────────────────────────────┐
│                     AI Assistant (Claude)                    │
└──────────────────┬───────────────────────────────────────────┘
                   │  MCP Protocol (JSON-RPC over stdio/SSE)
┌──────────────────▼───────────────────────────────────────────┐
│                    MCP Server (Python)                       │
│   - Tool definitions (list_tables, run_query, etc.)         │
│   - Input validation & sanitization                         │
│   - Query execution & result formatting                     │
└──────────────────┬───────────────────────────────────────────┘
                   │  Snowflake Connector (Python SDK)
┌──────────────────▼───────────────────────────────────────────┐
│                    Snowflake Data Warehouse                  │
│   - Virtual Warehouses, Databases, Schemas, Tables          │
└──────────────────────────────────────────────────────────────┘

Why MCP Over Raw API Calls?

Approach	MCP	Direct API
Standardized protocol	✅	❌
Tool discovery at runtime	✅ Auto	❌ Manual
Streaming support	✅	Partial
Multi-LLM compatible	✅ Any MCP client	❌ Vendor-specific
Safety controls built-in	✅	❌
Swappable backends	✅	❌

MCP tools are discoverable at runtime — the AI asks "what can you do?" and the server responds with a structured list of capabilities. No custom prompt engineering needed.

Architecture: Three Deployment Patterns

Pattern 1: Local stdio (Development)

Best for: Local development, Claude Desktop, personal use


Claude Desktop ──stdio──► Python MCP Server ──► Snowflake

The MCP server runs as a child process of Claude Desktop, communicating over stdin/stdout. Zero networking complexity.

Pattern 2: SSE Server (Team Use)

Best for: Shared team access, web UIs, multiple concurrent users


Web App / Multiple Clients ──HTTP SSE──► FastMCP Server ──► Snowflake

The server runs as a persistent HTTP service. Multiple users connect simultaneously.

Pattern 3: Cloud-Hosted (Production)

Best for: Enterprise, security compliance, high availability


Claude / Any LLM ──HTTPS──► MCP Gateway (Auth + Rate Limit) ──► MCP Server ──► Snowflake

Production deployments add authentication middleware, rate limiting, and audit logging between the LLM and the MCP server.

Prerequisites

Before starting, ensure you have:

Python 3.10–3.12 (3.13+ not yet supported by all Snowflake connector versions)
A Snowflake account with credentials:
- Account identifier (e.g., xy12345.us-east-1)
- Warehouse name
- Username
- Password or private key (RSA key-pair auth recommended)
- Role with appropriate permissions
- Database and Schema
Claude Desktop or any MCP-compatible client

Step 1 — Isolated Python Environment

Always use a dedicated virtual environment. Mixing dependencies pollutes your system Python and causes hard-to-debug version conflicts.


# Create a clean environment

python -m venv snowflake_mcp_env



# Activate it

# macOS / Linux:

source snowflake_mcp_env/bin/activate



# Windows (PowerShell):

.\snowflake_mcp_env\Scripts\Activate.ps1

Pro tip: Name your environment descriptively (not just venv) — it matters when you have multiple projects open.

Step 2 — Install Dependencies


# Core MCP Snowflake server

pip install mcp-snowflake-server



# Recommended: lock versions for reproducibility

pip freeze > requirements.txt

What gets installed:

mcp-snowflake-server — the MCP server implementation
snowflake-connector-python — Snowflake's official Python driver
fastmcp — the MCP framework underlying the server

Step 3 — Configuration Deep Dive

The MCP server is configured via a JSON config file. Here's a minimal configuration and then a production-hardened version.

Minimal Configuration (Local Dev)


{

  "mcpServers": {

    "snowflake": {

      "command": "/path/to/snowflake_mcp_env/bin/python",

      "args": [

        "-m", "mcp_snowflake_server",

        "--account",   "xy12345.us-east-1",

        "--warehouse", "COMPUTE_WH",

        "--user",      "analyst_user",

        "--password",  "supersecret",

        "--role",      "ANALYST_ROLE",

        "--database",  "PROD_DB",

        "--schema",    "PUBLIC"

      ]

    }

  }

}

Problem: Hardcoded Credentials

Never commit credentials to git. Use environment variables instead:


{

  "mcpServers": {

    "snowflake": {

      "command": "/path/to/snowflake_mcp_env/bin/python",

      "args": [

        "-m", "mcp_snowflake_server",

        "--account",   "<SNOWFLAKE_ACCOUNT>",

        "--warehouse", "<SNOWFLAKE_WAREHOUSE>",

        "--user",      "<SNOWFLAKE_USER>",

        "--password",  "<SNOWFLAKE_PASSWORD>",

        "--role",      "<SNOWFLAKE_ROLE>",

        "--database",  "<SNOWFLAKE_DATABASE>",

        "--schema",    "<SNOWFLAKE_SCHEMA>"

      ]

    }

  }

}

Set environment variables:


# macOS / Linux — add to ~/.zshrc or ~/.bashrc

export SNOWFLAKE_ACCOUNT="xy12345.us-east-1"

export SNOWFLAKE_WAREHOUSE="COMPUTE_WH"

export SNOWFLAKE_USER="analyst_user"

export SNOWFLAKE_PASSWORD="supersecret"

export SNOWFLAKE_ROLE="ANALYST_ROLE"

export SNOWFLAKE_DATABASE="PROD_DB"

export SNOWFLAKE_SCHEMA="PUBLIC"


# Windows

$env:SNOWFLAKE_ACCOUNT = "xy12345.us-east-1"

$env:SNOWFLAKE_PASSWORD = "supersecret"

Step 4 — Advanced Authentication: RSA Key-Pair (Recommended)

Password-based auth is convenient but less secure. For production, use RSA key-pair authentication — no password travels over the network.

Generate RSA Key Pair


# Generate private key (encrypted with passphrase)

openssl genrsa 2048 | openssl pkcs8 -topk8 -v2 des3 -inform PEM -out rsa_key.p8



# Extract public key

openssl rsa -in rsa_key.p8 -pubout -out rsa_key.pub

Register Public Key with Snowflake


-- Run in Snowflake worksheet

ALTER USER analyst_user SET RSA_PUBLIC_KEY='MIIBIjANBgkqhkiG9w0BAQEFAAOCAQ8AMIIBCgKCAQEA...';

Update MCP Configuration for Key Auth


{

  "mcpServers": {

    "snowflake": {

      "command": "/path/to/snowflake_mcp_env/bin/python",

      "args": [

        "-m", "mcp_snowflake_server",

        "--account",            "<SNOWFLAKE_ACCOUNT>",

        "--user",               "<SNOWFLAKE_USER>",

        "--private_key_path",   "<SNOWFLAKE_PRIVATE_KEY_PATH>",

        "--private_key_passphrase", "<SNOWFLAKE_KEY_PASSPHRASE>",

        "--role",               "<SNOWFLAKE_ROLE>",

        "--warehouse",          "<SNOWFLAKE_WAREHOUSE>",

        "--database",           "<SNOWFLAKE_DATABASE>",

        "--schema",             "<SNOWFLAKE_SCHEMA>"

      ]

    }

  }

}

Step 5 — Snowflake Security Model for MCP

Create a Dedicated Read-Only Role

Never give the MCP server admin privileges. Create a minimal-permission role:


-- Create a dedicated MCP role

CREATE ROLE IF NOT EXISTS MCP_ANALYST_ROLE;



-- Grant read-only access to specific schemas

GRANT USAGE ON DATABASE PROD_DB TO ROLE MCP_ANALYST_ROLE;

GRANT USAGE ON SCHEMA PROD_DB.PUBLIC TO ROLE MCP_ANALYST_ROLE;

GRANT SELECT ON ALL TABLES IN SCHEMA PROD_DB.PUBLIC TO ROLE MCP_ANALYST_ROLE;

GRANT SELECT ON FUTURE TABLES IN SCHEMA PROD_DB.PUBLIC TO ROLE MCP_ANALYST_ROLE;



-- Grant warehouse usage (limit credits)

GRANT USAGE ON WAREHOUSE COMPUTE_WH TO ROLE MCP_ANALYST_ROLE;



-- Assign role to user

GRANT ROLE MCP_ANALYST_ROLE TO USER analyst_user;

Restrict Warehouse Size

Prevent runaway queries from consuming excessive credits:


-- Set auto-suspend to 60 seconds (no idle billing)

ALTER WAREHOUSE COMPUTE_WH SET AUTO_SUSPEND = 60;



-- Optionally limit query timeout for MCP sessions

ALTER USER analyst_user SET STATEMENT_TIMEOUT_IN_SECONDS = 120;

Step 6 — Available MCP Tools and How to Use Them

Once connected, the MCP server exposes these tools to Claude:

Core Tools

Tool	Description	Read/Write
`list_databases`	List accessible databases	Read
`list_schemas`	List schemas in a database	Read
`list_tables`	List tables in a schema	Read
`describe_table`	Get column names, types, and sample values	Read
`run_query`	Execute a SELECT query and return results	Read
`get_sample_data`	Fetch a small sample from a table	Read

Tool Filtering (Exclude Dangerous Tools)


"args": [

  "--exclude_tools", "run_query",

  "--exclude_tools", "execute_ddl"

]

Use --exclude_tools to restrict what the AI can do. For read-only deployments, exclude any write or DDL tools.

Step 7 — Real-World Query Patterns

Here are production-tested prompts and the SQL they generate:

Business Intelligence Queries

Prompt:

"Show me the top 10 products by revenue for Q1 2025, broken down by region"

Generated SQL:


SELECT

    p.product_name,

    o.region,

    SUM(o.revenue) AS total_revenue,

    COUNT(DISTINCT o.order_id) AS order_count

FROM orders o

JOIN products p ON o.product_id = p.id

WHERE QUARTER(o.order_date) = 1 AND YEAR(o.order_date) = 2025

GROUP BY p.product_name, o.region

ORDER BY total_revenue DESC

LIMIT 10;

Trend Analysis

Prompt:

"What's the month-over-month customer churn trend for the past 6 months?"

Generated SQL:


SELECT

    DATE_TRUNC('month', cancelled_at) AS churn_month,

    COUNT(*) AS churned_customers,

    LAG(COUNT(*)) OVER (ORDER BY DATE_TRUNC('month', cancelled_at)) AS prev_month,

    ROUND(

        (COUNT(*) - LAG(COUNT(*)) OVER (ORDER BY DATE_TRUNC('month', cancelled_at)))

        / NULLIF(LAG(COUNT(*)) OVER (ORDER BY DATE_TRUNC('month', cancelled_at)), 0) * 100, 2

    ) AS mom_change_pct

FROM subscriptions

WHERE cancelled_at >= DATEADD('month', -6, CURRENT_DATE)

GROUP BY 1

ORDER BY 1;

Anomaly Detection

Prompt:

"Flag any orders where the total amount is more than 3 standard deviations from the average"


WITH stats AS (

    SELECT

        AVG(total_amount) AS avg_amount,

        STDDEV(total_amount) AS std_amount

    FROM orders

    WHERE order_date >= DATEADD('month', -3, CURRENT_DATE)

)

SELECT o.*

FROM orders o, stats s

WHERE ABS(o.total_amount - s.avg_amount) > 3 * s.std_amount

ORDER BY o.total_amount DESC;

Step 8 — Logging and Debugging

Enable logging for troubleshooting:


"args": [

  "--log_dir",   "/var/log/mcp_snowflake",

  "--log_level", "DEBUG"

]

Log files capture:

All tool calls from the AI
SQL queries executed
Snowflake connection events
Error tracebacks

Verify the Connection Manually


# Activate environment

source snowflake_mcp_env/bin/activate



# Test connection directly

python -c "

import snowflake.connector

conn = snowflake.connector.connect(

    account='xy12345.us-east-1',

    user='analyst_user',

    password='supersecret',

    warehouse='COMPUTE_WH',

    database='PROD_DB',

    schema='PUBLIC'

)

cursor = conn.cursor()

cursor.execute('SELECT CURRENT_VERSION()')

print(cursor.fetchone())

conn.close()

"

Step 9 — Common Pitfalls and Fixes

Python Version Incompatibility


ERROR: Could not install mcp-snowflake-server (requires Python ≤ 3.12)

Fix: Use Python 3.10, 3.11, or 3.12. Check with python --version.

Incorrect Python Path in Config


ERROR: spawn /path/to/python ENOENT

Fix: Always use the absolute path to the virtual environment's Python:


# Get the correct path

which python  # (after activating the venv)

# Output: /Users/yourname/snowflake_mcp_env/bin/python

Missing Snowflake Permissions


SQL compilation error: Object 'PROD_DB.PUBLIC.ORDERS' does not exist or not authorized

Fix: Verify role permissions:


SHOW GRANTS TO ROLE MCP_ANALYST_ROLE;

Network / Firewall Block


snowflake.connector.errors.DatabaseError: Failed to connect to DB

Fix: Ensure outbound HTTPS (port 443) is open to *.snowflakecomputing.com.

Warehouse Suspended / Auto-Start Delay

Fix: Enable auto-resume:


ALTER WAREHOUSE COMPUTE_WH SET AUTO_RESUME = TRUE;

Step 10 — Production Deployment Checklist

Before going to production, verify each item:


Security

  ☐ RSA key-pair authentication (no passwords)

  ☐ Credentials stored in secrets manager (AWS Secrets Manager / Azure Key Vault)

  ☐ Dedicated read-only role with minimal permissions

  ☐ No DDL/DML tool access

  ☐ Query timeout set on user/session



Reliability

  ☐ Warehouse auto-resume and auto-suspend configured

  ☐ Connection pool configured

  ☐ Error handling and retry logic in place

  ☐ Health check endpoint (if SSE server)



Observability

  ☐ Logging enabled with appropriate level

  ☐ Log rotation configured

  ☐ Snowflake query history monitored (SNOWFLAKE.ACCOUNT_USAGE.QUERY_HISTORY)



Compliance

  ☐ Audit trail for all AI-generated queries

  ☐ Data masking on sensitive columns (PII, PCI)

  ☐ Network policy limiting MCP server source IPs

Reference Implementations

All code discussed in this article is available on GitHub:

🔗 Multi-Database MCP Collection

A comprehensive collection of MCP server implementations including Snowflake, PostgreSQL, MySQL, Redis, MongoDB, and more:

👉 github.com/anjijava16/mcp_servers

🔗 Snowflake MCP Server (Development Branch)

The active Snowflake-specific implementation with the latest features:

👉 anjijava16/mcp_servers — mcp_snowflake_server

🔗 Reference: Standalone Snowflake MCP Server

A clean reference implementation following MCP standards closely:

👉 isaacwasserman/mcp-snowflake-server

What's Next?

Once the Snowflake MCP server is running in production, natural extensions include:

Extension	Description
Multi-database federation	Connect MCP to Snowflake + PostgreSQL + MongoDB simultaneously via one gateway
dbt semantic layer	Route AI queries through dbt metrics instead of raw tables — business-safe by construction
Google ADK integration	Expose MCP servers as tools inside Google Agent Development Kit agents
MLflow observability	Trace every MCP tool call through the OTel Collector into MLflow experiments
Streaming large results	Handle million-row result sets with SSE streaming instead of buffering
Prompt injection hardening	Add an LLM-based classifier to detect and reject adversarial tool inputs

Summary

What we covered	Key takeaway
MCP architecture	Universal adapter — one protocol for Snowflake, Postgres, Redis, and everything else
Three deployment patterns	stdio (dev) → SSE server (team) → Cloud gateway (enterprise)
RSA key-pair auth	No credential over the wire; rotate the key file, not a password
Snowflake RBAC	Dedicated minimal-permission role + network policy + warehouse caps
Real-world query patterns	BI aggregations, MoM trends, z-score anomaly detection, cohort retention
Logging and debugging	Structured logs + manual connection verification before deploy
Common pitfalls	Python version, absolute paths, role grants, firewall, prompt injection
MLflow integration	Close the loop — every MCP tool call becomes an observable trace
Production checklist	20-point checklist across auth, authorization, reliability, observability, compliance

The combination becomes even more powerful when paired with an observability stack: every natural language question, every generated SQL query, and every Snowflake response becomes a traceable, auditable, improvable event in your system.

MCP + Snowflake is one of the most immediately practical AI integrations available today. Your entire data team can start exploring data warehouse contents conversationally — no SQL required — while you maintain full security control over what the AI can and cannot do.

About the Author

anjijava16 builds MCP integrations for databases, AI agents, and data infrastructure. Follow on HackerNoon for more deep dives in the MCP Deep Dives series.

Source code: github.com/anjijava16/mcp_servers — contributions and stars welcome.