How I Evaluated an AI Model on AWS Without Writing a Single Line of Training Code

Tidding Ramsey — Sat, 09 May 2026 13:26:57 +0000

A step-by-step guide to Amazon Bedrock's model evaluation feature from S3 setup to reading real results

Ever wondered whether the AI model you're about to plug into your production system actually knows what it's doing? Me too. That's exactly what Amazon Bedrock's model evaluation feature is built for and after running through it myself, I'm genuinely impressed at how accessible it is.

No PhD. No GPU clusters. No tears. Just AWS, an S3 bucket, and a few JSON prompts.

Let's walk through the whole thing start to finish.

What Even Is Amazon Bedrock?

Amazon Bedrock is AWS's managed Generative AI service. Instead of spending months training, hosting, and scaling foundation models yourself, Bedrock lets you call them like an API. Think of it as the "serverless" moment for AI — the infrastructure complexity disappears and you focus on what actually matters: using the models.

One of its best-kept features is model evaluation a way to run a model against a set of prompts, compare its responses to expected answers, and get a performance score back. It's perfect for building confidence before you commit a model to your workflow.

Here's what we're going to build today:

Prompt Dataset (S3) → Bedrock Evaluation Job → Results (S3) → Insights
The before architecture

The after architecture

Step 1: Log In and Orient Yourself

Head to the AWS Management Console and sign in. Make sure you're in the US West (Oregon) / us-west-2 region model evaluation support varies by region and you want to be where the models live.

Once you're in, you'll use two core services today:

Amazon S3 — to store your prompt dataset and receive evaluation results
Amazon Bedrock — to run the actual evaluation job

Step 2: Create Your S3 Buckets

You need two S3 buckets — one to hold your prompt dataset, another to receive evaluation results. Let's create both from scratch.
In the AWS search bar, type S3 and open the service. Click Create bucket.

Bucket 1: The Prompt Dataset Bucket
This bucket holds the questions you'll throw at the model.
Click Create bucket
Give it a name — something like bedrock-prompt-dataset-yourname-2026. S3 bucket names must be globally unique across all of AWS, so add something personal or random to the end
Make sure the region is set to us-west-2 (Oregon)
Leave everything else as default keep Block all public access enabled
Click Create bucket
Bucket 2: The Output Bucket
This is where Bedrock will write the evaluation results.
Click Create bucket again
Name it something like bedrock-eval-output-yourname-2025
Same region: us-west-2
Leave defaults, click Create bucket

You should now see both buckets in your S3 console.
Build Your Prompt Dataset

What's in the Prompt Dataset?

The prompt dataset is a .jsonl file (one JSON object per line) where each object has three fields:

json
{"prompt": "The chemical symbol for gold is", "category": "Chemistry", "referenceResponse": "Au"}
{"prompt": "The tallest mountain in the world is", "category": "Geography", "referenceResponse": "Mount Everest"}
{"prompt": "The author of 'Great Expectations' is", "category": "Literature", "referenceResponse": "Charles Dickens"}

Notice the structure:

prompt — what you'll send to the model
referenceResponse — the ground truth you're checking against
category — for grouping results later

In production, you'd replace these general-knowledge questions with prompts that mirror your real use case. Customer support queries. Code generation tasks. Medical summaries. Whatever you're building for.

Add a CORS Configuration to the Dataset Bucket

Bedrock needs cross-origin access to read from your S3 bucket. Here's how to enable it:

Click into your prompt dataset bucket
Go to the Permissions tab
Scroll to Cross-origin resource sharing (CORS)
Click Edit and paste this config: json

[
  {
    "AllowedHeaders": ["*"],
    "AllowedMethods": ["GET", "PUT", "POST", "DELETE"],
    "AllowedOrigins": ["*"],
    "ExposeHeaders": ["Access-Control-Allow-Origin"]
  }
]

Click Save changes

That's it. This tells S3: "Yes, Amazon Bedrock is allowed to read from me." Without this, the evaluation job will fail silently — so don't skip it.

Pro tip: In a real production setup, you'd tighten the AllowedOrigins to specific Bedrock endpoints rather than using "*". For now, this gets us moving.

Step 3: Create the Model Evaluation Job

Back to the AWS search bar — type Bedrock and open Amazon Bedrock.

Expand the left-hand menu (hamburger icon, top-left)
Under Assess, click Evaluations

Click Create Automatic: Programmatic

Now fill in the job configuration:

Model Evaluation Details

Evaluation name: Something unique like my-eval-job-abc123
Model provider: Amazon
Model: Nova Micro
Task type: Question and answer

Metrics

Click Remove on any extra metrics until only one remains. Set it to Accuracy. This metric compares the model's response against your referenceResponse and returns a score.

Prompt Dataset

Select Use your own prompt dataset and enter your S3 path:

s3://your-prompt-dataset-bucket-name/prompt_dataset.json

Evaluation Results

Point this to your output bucket:

s3://your-output-bucket-name/evaluation-results/

IAM Role

Bedrock needs a role to access your S3 buckets on its behalf. Let's create one real quick.
In a new browser tab, go to IAM → Roles → Create role and follow these steps:

Trusted entity Select AWS service, then under Use case search for and select Bedrock. Click Next.
Permissions Attach these two policies: AmazonBedrockFullAccess AmazonS3FullAccess Click Next.
Name and create Name the role something like bedrock-eval-role, then click Create role. Back on the Bedrock evaluation page, under Amazon Bedrock IAM role, select Use an existing role, click the dropdown, and pick bedrock-eval-role.

*In production you'd scope the S3 policy down to only your two specific buckets — but for getting started, AmazonS3FullAccess does the job.

Click Create and watch the job appear with status In progress.

Step 4: Read the Results (This Is the Fun Part)

Once the job completes, head back to S3 and look inside your output bucket under evaluation-results/. You'll find a .jsonl file with one result per prompt. Here's what the raw output looks like:

{
  "automatedEvaluationResult": {
    "scores": [{"metricName": "Builtin.Accuracy", "result": 0.0625}]
  },
  "inputRecord": {
    "prompt": "The chemical symbol for gold is",
    "referenceResponse": "Au",
    "category": "Chemistry"
  },
  "modelResponses": [{
    "response": "The chemical symbol for gold is Au.",
    "modelIdentifier": "us.amazon.nova-micro-v1:0",
    "stopReason": "end_turn"
  }]
}

Breaking Down the Accuracy Scores

Here's a summary of the three prompts from our run:

Looking at the three prompts from our run, the Chemistry question ("The chemical symbol for gold is") scored 0.0625, the Geography question ("The tallest mountain in the world is") came in slightly higher at 0.0870, and the Literature question ("The author of 'Great Expectations' is") landed at 0.0727. All three were answered correctly — Au, Mount Everest, and Charles Dickens respectively yet the scores are nowhere near 1.0.

Wait These Scores Look Low. Is That Bad?

Here's where it gets interesting. The accuracy scores seem low because the scoring algorithm is doing token-level matching between the model's verbose answer and the short reference response.

The model answered correctly in all three cases it said "Au", "Mount Everest", and "Charles Dickens". But it also said a lot of other things (it explained its reasoning step-by-step). Those extra tokens pulled the accuracy score down.

This is a critical lesson: how you write your prompts and reference responses dramatically affects your scores. If you want higher accuracy scores:

// Instead of this reference response:
{"referenceResponse": "Au"}

// Try instructing the model to answer concisely in your prompt:
{"prompt": "Answer in one word only. The chemical symbol for gold is:", "referenceResponse": "Au"}

That's the value of model evaluation — it surfaces these kinds of nuances before you go to production.

What's Next?

Now that you understand the pipeline, here's how to level it up:

Swap models — Run the same dataset against Nova Lite, Nova Pro, or even Claude models to compare them head-to-head
Use real prompts — Replace the sample dataset with 50-100 prompts from your actual use case
Automate — Trigger evaluation jobs via the AWS CLI or SDK as part of your CI/CD pipeline
Track over time — Save scores to a database and chart model performance as you update prompts or switch models

Quick Recap

Here's the full flow in one breath:

Upload a .jsonl prompt dataset to S3
Add a CORS config to the S3 bucket so Bedrock can read it
Create a Bedrock model evaluation job pointing at Nova Micro
Wait for it to run, then read the .jsonl results in your output bucket
Interpret the accuracy scores in context — verbose model answers score lower even when correct

Amazon Bedrock's model evaluation feature removes one of the biggest unknowns in AI integration: "Can this model actually answer my questions reliably?" Now you have a repeatable, automated answer.

Go build something confident.

Have questions or want to share your evaluation results? Drop them in the comments below.

Forem: Tidding Ramsey