Mohamad Albaker Kawtharani

Posted on Apr 24 • Edited on Jun 5

Elevating LLM Evaluation with DeepEval with Native Amazon Bedrock Support

#genai #aws #bedrock #llm

Elevating LLM Evaluation with DeepEval: Now with Native Amazon Bedrock Support

As Large Language Models (LLMs) move from research labs to production environments, robust evaluation becomes critical. Whether you're building Retrieval-Augmented Generation (RAG) systems, deploying agentic workflows, or integrating LLMs into enterprise products, you need confidence in your model outputs.

What is DeepEval?

DeepEval is an open-source framework designed specifically for comprehensive LLM evaluation across diverse use cases. It provides:

Customizable metrics tailored to specific evaluation needs
Evaluation pipelines for systematic testing
Context-aware validation for retrieval-based, conversational, and workflow applications

This makes DeepEval ideal for developers and data scientists who need to validate model quality beyond basic accuracy metrics.

🔥 What's New: Amazon Bedrock Native Support

We're excited to announce that DeepEval now fully supports Amazon Bedrock models, including Claude, Titan, and the complete Bedrock model lineup. This integration enables enterprise teams to:

Maintain data sovereignty by keeping sensitive evaluation data within your AWS environment
Leverage existing AWS infrastructure with seamless workflow integration
Scale evaluations using Bedrock's managed infrastructure and compliance features

Check out Pull Request #1426 for implementation details on this important integration.

🌍 Open to the Community

LLM evaluation remains an evolving challenge that benefits from diverse perspectives. We're actively seeking contributions in:

Model support expansion beyond current providers and architectures
Enhanced evaluation metrics for factual consistency, coherence, and bias detection
New use case coverage including code generation, summarization, and multilingual tasks

Join data scientists, ML engineers, and researchers in making LLM evaluation more robust and accessible.

Get involved: https://github.com/zeroandoneme/deepeval/

🚀 Getting Started: Using DeepEval with Amazon Bedrock

Here's how to evaluate your Bedrock models using DeepEval:

Install Deepeval

pip install deepeval

Setup AWS Credentials Make sure your AWS credentials are configured (either via environment variables, ~/.aws/credentials, or an IAM role). This gives DeepEval access to Amazon Bedrock.

export AWS_ACCESS_KEY_ID=your_access_key
export AWS_SECRET_ACCESS_KEY=your_secret_key
export AWS_DEFAULT_REGION=your_region

Run an Evaluation Get Output

from deepeval.models.llms.bedrock_model import BedrockModel

# Initialize the Bedrock model (e.g., Claude)
model = BedrockModel(
    model_id="us.anthropic.claude-3-7-sonnet-20250219-v1:0",
    region="us-east-1"
)

# Define your input prompt
prompt = "Summarize the following text: Anthropic Claude 3.7 Sonnet is the first Claude model to offer step-by-step reasoning, which Anthropic has termed "extended thinking". With Claude 3.7 Sonnet, use of step-by-step reasoning is optional. You can choose between standard thinking and extended thinking for advanced reasoning. Along with extended thinking, Claude 3.7 Sonnet allows up to 128K output tokens per request (up to 64K output tokens is considered generally available, but outputs between 64K and 128K are in beta). Additionally, Anthropic has enhanced its computer use beta with support for new actions."

# Run the model
output = model.generate(prompt)

Evaluate

from deepeval import evaluate
from deepeval.metrics import HallucinationMetric
from deepeval.test_case import LLMTestCase

retrieval_context = [
    "Anthropic Claude 3.7 Sonnet is the first Claude model to introduce optional step-by-step reasoning, called "extended thinking," which users can toggle alongside standard thinking. "
    "It supports up to 128K output tokens per request (with 64K–128K currently in beta) and features an enhanced computer use beta with support for new automated actions."
]

test_case = LLMTestCase(
    input="What new reasoning feature does Claude 3.7 Sonnet introduce?",
    actual_output=output,
    context=retrieval_context
)
metric = HallucinationMetric(model=model)

# To run metric as a standalone
metric.measure(test_case)
print(metric.score, metric.reason)

evaluate(test_cases=[test_case], metrics=[metric])

Why This Matters

As AI adoption accelerates, evaluation becomes the critical safety net ensuring reliable production systems. DeepEval bridges the gap between cutting-edge research and practical deployment, enabling organizations to ship AI solutions with confidence.

The addition of Amazon Bedrock support particularly benefits enterprise users who require secure, compliant LLM evaluation within their existing cloud infrastructure.

Join Us in Building Better AI

Whether you're evaluating fine-tuned models, creating RAG applications, or deploying conversational agents, DeepEval provides the framework to measure what matters.

We invite you to explore DeepEval, integrate it into your evaluation workflow, and contribute to its development.

Together, we can build more reliable, transparent AI.

DEV Community