<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Tidding Ramsey</title>
    <description>The latest articles on Forem by Tidding Ramsey (@tidding).</description>
    <link>https://forem.com/tidding</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1451266%2F0e52ea0e-d12c-4765-9838-4ddcc1b65f4e.png</url>
      <title>Forem: Tidding Ramsey</title>
      <link>https://forem.com/tidding</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/tidding"/>
    <language>en</language>
    <item>
      <title>How I Evaluated an AI Model on AWS Without Writing a Single Line of Training Code</title>
      <dc:creator>Tidding Ramsey</dc:creator>
      <pubDate>Sat, 09 May 2026 13:26:57 +0000</pubDate>
      <link>https://forem.com/tidding/how-i-evaluated-an-ai-model-on-aws-without-writing-a-single-line-of-training-code-20o9</link>
      <guid>https://forem.com/tidding/how-i-evaluated-an-ai-model-on-aws-without-writing-a-single-line-of-training-code-20o9</guid>
      <description>&lt;p&gt;A step-by-step guide to Amazon Bedrock's model evaluation feature  from S3 setup to reading real results&lt;/p&gt;

&lt;p&gt;Ever wondered whether the AI model you're about to plug into your production system actually &lt;strong&gt;knows what it's doing&lt;/strong&gt;? Me too. That's exactly what Amazon Bedrock's &lt;strong&gt;model evaluation&lt;/strong&gt; feature is built for  and after running through it myself, I'm genuinely impressed at how accessible it is.&lt;/p&gt;

&lt;p&gt;No PhD. No GPU clusters. No tears. Just AWS, an S3 bucket, and a few JSON prompts.&lt;/p&gt;

&lt;p&gt;Let's walk through the whole thing  start to finish.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What Even Is Amazon Bedrock?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Amazon Bedrock is AWS's managed Generative AI service. Instead of spending months training, hosting, and scaling foundation models yourself, Bedrock lets you call them like an API. Think of it as the "serverless" moment for AI — the infrastructure complexity disappears and you focus on what actually matters: using the models.&lt;/p&gt;

&lt;p&gt;One of its best-kept features is &lt;strong&gt;model evaluation&lt;/strong&gt;  a way to run a model against a set of prompts, compare its responses to expected answers, and get a performance score back. It's perfect for building confidence before you commit a model to your workflow.&lt;/p&gt;

&lt;p&gt;Here's what we're going to build today:&lt;/p&gt;

&lt;p&gt;Prompt Dataset (S3) → Bedrock Evaluation Job → Results (S3) → Insights&lt;br&gt;
The before architecture&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxfueqy2mglh4zzbm8qbf.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxfueqy2mglh4zzbm8qbf.png" alt=" " width="563" height="476"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The after architecture&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdxkkaqjxaa2m99hg1tpi.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdxkkaqjxaa2m99hg1tpi.png" alt=" " width="538" height="456"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  Step 1: Log In and Orient Yourself
&lt;/h2&gt;

&lt;p&gt;Head to the &lt;a href="https://aws.amazon.com/console/" rel="noopener noreferrer"&gt;AWS Management Console&lt;/a&gt; and sign in. Make sure you're in the &lt;strong&gt;US West (Oregon) / us-west-2&lt;/strong&gt; region  model evaluation support varies by region and you want to be where the models live.&lt;/p&gt;

&lt;p&gt;Once you're in, you'll use two core services today:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Amazon S3&lt;/strong&gt; — to store your prompt dataset and receive evaluation results&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Amazon Bedrock&lt;/strong&gt; — to run the actual evaluation job&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  Step 2: Create Your S3 Buckets
&lt;/h2&gt;

&lt;p&gt;You need two S3 buckets — one to hold your prompt dataset, another to receive evaluation results. Let's create both from scratch.&lt;br&gt;
In the AWS search bar, type S3 and open the service. Click Create bucket.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyu3nni706yu9jxs4em73.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyu3nni706yu9jxs4em73.png" alt=" " width="800" height="133"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Bucket 1: The Prompt Dataset Bucket&lt;br&gt;
This bucket holds the questions you'll throw at the model.&lt;br&gt;
Click Create bucket&lt;br&gt;
Give it a name — something like bedrock-prompt-dataset-yourname-2026. S3 bucket names must be globally unique across all of AWS, so add something personal or random to the end&lt;br&gt;
Make sure the region is set to us-west-2 (Oregon)&lt;br&gt;
Leave everything else as default keep Block all public access enabled&lt;br&gt;
Click Create bucket&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Bucket 2: The Output Bucket&lt;br&gt;
This is where Bedrock will write the evaluation results.&lt;br&gt;
Click Create bucket again&lt;br&gt;
Name it something like bedrock-eval-output-yourname-2025&lt;br&gt;
Same region: us-west-2&lt;br&gt;
Leave defaults, click Create bucket&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You should now see both buckets in your S3 console.&lt;br&gt;
&lt;strong&gt;Build Your Prompt Dataset&lt;/strong&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  What's in the Prompt Dataset?
&lt;/h3&gt;

&lt;p&gt;The prompt dataset is a &lt;code&gt;.jsonl&lt;/code&gt; file (one JSON object per line) where each object has three fields:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="err"&gt;json&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"prompt"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"The chemical symbol for gold is"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"category"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Chemistry"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"referenceResponse"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Au"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"prompt"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"The tallest mountain in the world is"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"category"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Geography"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"referenceResponse"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Mount Everest"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"prompt"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"The author of 'Great Expectations' is"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"category"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Literature"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"referenceResponse"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Charles Dickens"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjjvjscotsg0zi4ncjqvb.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjjvjscotsg0zi4ncjqvb.png" alt=" " width="800" height="61"&gt;&lt;/a&gt;&lt;br&gt;
Notice the structure:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;prompt&lt;/code&gt;&lt;/strong&gt; — what you'll send to the model&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;referenceResponse&lt;/code&gt;&lt;/strong&gt; — the ground truth you're checking against&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;category&lt;/code&gt;&lt;/strong&gt; — for grouping results later&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In production, you'd replace these general-knowledge questions with prompts that mirror your real use case. Customer support queries. Code generation tasks. Medical summaries. Whatever you're building for.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Add a CORS Configuration to the Dataset Bucket&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Bedrock needs cross-origin access to read from your S3 bucket. Here's how to enable it:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Click into your &lt;strong&gt;prompt dataset bucket&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Go to the &lt;strong&gt;Permissions&lt;/strong&gt; tab&lt;/li&gt;
&lt;li&gt;Scroll to &lt;strong&gt;Cross-origin resource sharing (CORS)&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Click &lt;strong&gt;Edit&lt;/strong&gt; and paste this config:
json
&lt;/li&gt;
&lt;/ol&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"AllowedHeaders"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"*"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"AllowedMethods"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"GET"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"PUT"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"POST"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"DELETE"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"AllowedOrigins"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"*"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"ExposeHeaders"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"Access-Control-Allow-Origin"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwcc7amybi2n5jdeudjf7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwcc7amybi2n5jdeudjf7.png" alt=" " width="800" height="85"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Click &lt;strong&gt;Save changes&lt;/strong&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;That's it. This tells S3: "Yes, Amazon Bedrock is allowed to read from me." Without this, the evaluation job will fail silently — so don't skip it.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Pro tip:&lt;/strong&gt; In a real production setup, you'd tighten the &lt;code&gt;AllowedOrigins&lt;/code&gt; to specific Bedrock endpoints rather than using &lt;code&gt;"*"&lt;/code&gt;. For now, this gets us moving.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h2&gt;
  
  
  Step 3: Create the Model Evaluation Job
&lt;/h2&gt;

&lt;p&gt;Back to the AWS search bar — type &lt;strong&gt;Bedrock&lt;/strong&gt; and open Amazon Bedrock.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fufadbwfyr11nmsivy1p1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fufadbwfyr11nmsivy1p1.png" alt=" " width="800" height="136"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Expand the left-hand menu (hamburger icon, top-left)&lt;/li&gt;
&lt;li&gt;Under &lt;strong&gt;Assess&lt;/strong&gt;, click &lt;strong&gt;Evaluations&lt;/strong&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqiz0s0dsnqkf0oa2wopj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqiz0s0dsnqkf0oa2wopj.png" alt=" " width="800" height="791"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Click &lt;strong&gt;Create Automatic: Programmatic&lt;/strong&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fd5rbmortepmmw84wqmc4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fd5rbmortepmmw84wqmc4.png" alt=" " width="793" height="468"&gt;&lt;/a&gt;&lt;br&gt;
Now fill in the job configuration:&lt;/p&gt;
&lt;h2&gt;
  
  
  Model Evaluation Details
&lt;/h2&gt;

&lt;p&gt;Evaluation name: Something unique like &lt;code&gt;my-eval-job-abc123&lt;/code&gt; &lt;br&gt;
Model provider: Amazon&lt;br&gt;
Model: &lt;strong&gt;Nova Micro&lt;/strong&gt; &lt;br&gt;
Task type: Question and answer &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fa7nzpoh2l4aikgx240ok.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fa7nzpoh2l4aikgx240ok.png" alt=" " width="800" height="667"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  Metrics
&lt;/h2&gt;

&lt;p&gt;Click &lt;strong&gt;Remove&lt;/strong&gt; on any extra metrics until only one remains. Set it to &lt;strong&gt;Accuracy&lt;/strong&gt;. This metric compares the model's response against your &lt;code&gt;referenceResponse&lt;/code&gt; and returns a score.&lt;/p&gt;
&lt;h2&gt;
  
  
  Prompt Dataset
&lt;/h2&gt;

&lt;p&gt;Select &lt;strong&gt;Use your own prompt dataset&lt;/strong&gt; and enter your S3 path:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;s3://your-prompt-dataset-bucket-name/prompt_dataset.json
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Evaluation Results
&lt;/h2&gt;

&lt;p&gt;Point this to your output bucket:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;s3://your-output-bucket-name/evaluation-results/
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  IAM Role
&lt;/h2&gt;

&lt;p&gt;Bedrock needs a role to access your S3 buckets on its behalf. Let's create one real quick.&lt;br&gt;
In a new browser tab, go to IAM → Roles → Create role and follow these steps:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Trusted entity
Select AWS service, then under Use case search for and select Bedrock. Click Next.&lt;/li&gt;
&lt;li&gt;Permissions
Attach these two policies:
&lt;em&gt;AmazonBedrockFullAccess
AmazonS3FullAccess&lt;/em&gt;
Click Next.&lt;/li&gt;
&lt;li&gt;Name and create
Name the role something like bedrock-eval-role, then click Create role.
Back on the Bedrock evaluation page, under Amazon Bedrock IAM role, select Use an existing role, click the dropdown, and pick bedrock-eval-role.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;*In production you'd scope the S3 policy down to only your two specific buckets — but for getting started, AmazonS3FullAccess does the job.&lt;/p&gt;

&lt;p&gt;Click &lt;strong&gt;Create&lt;/strong&gt;  and watch the job appear with status &lt;strong&gt;In progress&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffc0b8qg0euhhxadrcbt3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffc0b8qg0euhhxadrcbt3.png" alt=" " width="800" height="167"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  Step 4: Read the Results (This Is the Fun Part)
&lt;/h2&gt;

&lt;p&gt;Once the job completes, head back to S3 and look inside your output bucket under &lt;code&gt;evaluation-results/&lt;/code&gt;. You'll find a &lt;code&gt;.jsonl&lt;/code&gt; file with one result per prompt. Here's what the raw output looks like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"automatedEvaluationResult"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"scores"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="nl"&gt;"metricName"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Builtin.Accuracy"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"result"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.0625&lt;/span&gt;&lt;span class="p"&gt;}]&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"inputRecord"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"prompt"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"The chemical symbol for gold is"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"referenceResponse"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Au"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"category"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Chemistry"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"modelResponses"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"response"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"The chemical symbol for gold is Au."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"modelIdentifier"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"us.amazon.nova-micro-v1:0"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"stopReason"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"end_turn"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fy87zfii6ntxmtyxritx2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fy87zfii6ntxmtyxritx2.png" alt=" " width="800" height="91"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Breaking Down the Accuracy Scores
&lt;/h3&gt;

&lt;p&gt;Here's a summary of the three prompts from our run:&lt;/p&gt;

&lt;p&gt;Looking at the three prompts from our run, the Chemistry question ("The chemical symbol for gold is") scored 0.0625, the Geography question ("The tallest mountain in the world is") came in slightly higher at 0.0870, and the Literature question ("The author of 'Great Expectations' is") landed at 0.0727. All three were answered correctly — Au, Mount Everest, and Charles Dickens respectively  yet the scores are nowhere near 1.0.&lt;/p&gt;

&lt;h2&gt;
  
  
  Wait  These Scores Look Low. Is That Bad?
&lt;/h2&gt;

&lt;p&gt;Here's where it gets interesting. The accuracy scores seem low because the scoring algorithm is doing &lt;strong&gt;token-level matching&lt;/strong&gt; between the model's verbose answer and the short reference response. &lt;/p&gt;

&lt;p&gt;The model answered correctly in all three cases it said "Au", "Mount Everest", and "Charles Dickens". But it also said a &lt;em&gt;lot of other things&lt;/em&gt; (it explained its reasoning step-by-step). Those extra tokens pulled the accuracy score down.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;This is a critical lesson:&lt;/strong&gt; how you write your prompts and reference responses &lt;em&gt;dramatically&lt;/em&gt; affects your scores. If you want higher accuracy scores:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="err"&gt;//&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;Instead&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;of&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;this&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;reference&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;response:&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"referenceResponse"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Au"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="err"&gt;//&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;Try&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;instructing&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;the&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;model&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;to&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;answer&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;concisely&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;in&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;your&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;prompt:&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"prompt"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Answer in one word only. The chemical symbol for gold is:"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"referenceResponse"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Au"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's the value of model evaluation — it surfaces these kinds of nuances before you go to production.&lt;/p&gt;




&lt;h2&gt;
  
  
  What's Next?
&lt;/h2&gt;

&lt;p&gt;Now that you understand the pipeline, here's how to level it up:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Swap models&lt;/strong&gt; — Run the same dataset against &lt;code&gt;Nova Lite&lt;/code&gt;, &lt;code&gt;Nova Pro&lt;/code&gt;, or even Claude models to compare them head-to-head&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Use real prompts&lt;/strong&gt; — Replace the sample dataset with 50-100 prompts from your actual use case&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Automate&lt;/strong&gt; — Trigger evaluation jobs via the AWS CLI or SDK as part of your CI/CD pipeline&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Track over time&lt;/strong&gt; — Save scores to a database and chart model performance as you update prompts or switch models&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Quick Recap
&lt;/h2&gt;

&lt;p&gt;Here's the full flow in one breath:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; Upload a &lt;code&gt;.jsonl&lt;/code&gt; prompt dataset to S3&lt;/li&gt;
&lt;li&gt; Add a CORS config to the S3 bucket so Bedrock can read it&lt;/li&gt;
&lt;li&gt; Create a Bedrock model evaluation job pointing at Nova Micro&lt;/li&gt;
&lt;li&gt; Wait for it to run, then read the &lt;code&gt;.jsonl&lt;/code&gt; results in your output bucket&lt;/li&gt;
&lt;li&gt; Interpret the accuracy scores in context — verbose model answers score lower even when correct&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Amazon Bedrock's model evaluation feature removes one of the biggest unknowns in AI integration: &lt;em&gt;"Can this model actually answer my questions reliably?"&lt;/em&gt; Now you have a repeatable, automated answer.&lt;/p&gt;

&lt;p&gt;Go build something confident. &lt;/p&gt;




&lt;p&gt;&lt;em&gt;Have questions or want to share your evaluation results? Drop them in the comments below.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>aws</category>
      <category>machinelearning</category>
      <category>cloudcomputing</category>
    </item>
  </channel>
</rss>
