Forem: Sachin m

How to Build an AI-Powered Job Architecture System with crewAI and Amazon Bedrock

Sachin m — Sun, 22 Feb 2026 14:31:49 +0000

Most job descriptions are born from copy-paste. A recruiter grabs last year's JD, swaps out a few bullet points, adjusts the title, and posts it. The result: inconsistent leveling, missing skills, salary ranges pulled from gut feel, and qualifications lists that scare away half the qualified candidates.

I wanted to see what happens when you throw a team of specialized AI agents at this problem instead. Not one prompt — four agents, each focused on a different slice of job architecture, passing structured data to the next. Market research feeds into skill mapping, which feeds into competency frameworks, which all merge into a final JD that actually holds together.

By the end of this tutorial, you will have:

A single-prompt baseline that shows what a raw LLM produces for job descriptions
A 4-agent crewAI pipeline that chains market research, skill taxonomy, competency framework, and JD composition
Pydantic schemas that enforce structured, parseable output from every agent
A side-by-side comparison with real metrics — tokens, cost, latency

The whole thing runs on Amazon Bedrock with Nova Pro. Total cost for the full tutorial: under $0.15.

Prerequisites

AWS account with Bedrock access in us-east-1 (Nova Pro model enabled)
Python 3.12+
AWS credentials configured (aws configure or environment variables)
About $0.15 to run everything

Step 1 — Set Up crewAI with Amazon Bedrock

Install the dependencies:

pip install "crewai[tools]>=1.9.0" boto3 python-dotenv

Create a .env file in your project directory:

AWS_DEFAULT_REGION=us-east-1
MODEL=bedrock/amazon.nova-pro-v1:0

crewAI talks to Bedrock through litellm under the hood — the bedrock/ prefix in the model string handles the routing. No extra Bedrock SDK setup needed beyond having valid AWS credentials.

(If you're on an AISPL account like mine, stick with Nova models. Claude on Bedrock requires Marketplace billing sorted out first, and that's a separate adventure.)

Step 2 — Define the Output Schemas

This is the part that separates a demo from something you could actually plug into an ATS or HRIS. Instead of letting agents return free-form text, we define Pydantic models that force structured JSON output.

Create models.py:

"""Output schemas for the job architecture crew."""

from pydantic import BaseModel, Field


# --- Agent 1: Market Research ---

class SalaryRange(BaseModel):
    currency: str = "USD"
    min_annual: int
    max_annual: int
    median_annual: int

class MarketResearchOutput(BaseModel):
    """What the Market Research Analyst produces."""
    job_title: str
    alternative_titles: list[str] = Field(description="Common variations of this role title")
    salary_range: SalaryRange
    market_demand: str = Field(description="High / Medium / Low with brief explanation")
    industry_context: str = Field(description="Where this role is most common and why")
    typical_team_structure: str = Field(description="Who this role reports to and works alongside")
    remote_prevalence: str = Field(description="Remote/hybrid/onsite trends for this role")


# --- Agent 2: Skills Taxonomy ---

class Skill(BaseModel):
    name: str
    importance: str = Field(description="required / preferred / nice-to-have")
    context: str = Field(default="", description="Why this skill matters for the role")

class SkillTaxonomyOutput(BaseModel):
    """Structured skill breakdown from the Skills Analyst."""
    technical_skills: list[Skill]
    soft_skills: list[Skill]
    domain_skills: list[Skill] = Field(description="Industry or function-specific knowledge")
    tools_and_platforms: list[Skill] = Field(description="Specific tools, languages, or platforms")
    certifications: list[Skill] = Field(description="Relevant professional certifications")


# --- Agent 3: Competency Framework ---

class ProficiencyLevel(BaseModel):
    level: str = Field(description="e.g. Junior, Mid, Senior, Lead")
    description: str
    years_experience: str

class Competency(BaseModel):
    name: str
    description: str
    proficiency_levels: list[ProficiencyLevel]
    assessment_methods: list[str] = Field(description="How to evaluate this competency")

class CompetencyFrameworkOutput(BaseModel):
    """Competency framework from the Framework Designer."""
    role_level: str = Field(description="Target seniority for this framework")
    competencies: list[Competency]
    experience_requirements: str
    education_requirements: str


# --- Agent 4: Job Description ---

class JobDescriptionOutput(BaseModel):
    """Final structured JD from the Composer."""
    title: str
    department: str
    summary: str = Field(description="2-3 sentence role overview")
    responsibilities: list[str]
    required_qualifications: list[str]
    preferred_qualifications: list[str]
    competency_requirements: list[str] = Field(
        description="Key competencies with expected proficiency level"
    )
    salary_range: str
    benefits_highlights: list[str]
    growth_path: str = Field(description="Career progression from this role")
    dei_statement: str
    work_arrangement: str = Field(description="Remote / hybrid / onsite details")

The Field(description=...) annotations do double duty — they document the schema for us and they tell the LLM what to put in each field. crewAI serializes these descriptions into the prompt automatically.

The Skill model with its importance field is the one I keep coming back to. A flat list of "requirements" is useless for hiring. Tagging each skill as required, preferred, or nice-to-have forces the agent to make real decisions — and it gives recruiters something they can actually filter on.

Step 3 — The Baseline: One Prompt, One JD

Before building the crew, we need a control. What does a single LLM call produce when you ask for a job description?

Create 01_baseline_single_prompt.py:

"""
Baseline: generate a job description with a single LLM call.
No agents, no structured output — just one prompt to Nova Pro.
"""

import time
import json
import boto3

bedrock = boto3.client("bedrock-runtime", region_name="us-east-1")
MODEL_ID = "amazon.nova-pro-v1:0"

role_title = "Senior ML Engineer"

prompt = f"""Create a complete job description for a {role_title} position.
Include: role summary, responsibilities, required qualifications,
preferred qualifications, salary range, and benefits."""

start = time.time()

resp = bedrock.converse(
    modelId=MODEL_ID,
    messages=[{"role": "user", "content": [{"text": prompt}]}],
    inferenceConfig={"maxTokens": 2048, "temperature": 0.7}
)

elapsed = time.time() - start

output_text = resp["output"]["message"]["content"][0]["text"]
usage = resp["usage"]
input_tokens = usage["inputTokens"]
output_tokens = usage["outputTokens"]

# Nova Pro pricing: $0.0008/1K input, $0.0032/1K output
cost = (input_tokens / 1000 * 0.0008) + (output_tokens / 1000 * 0.0032)

print("=" * 60)
print(f"BASELINE — Single Prompt JD: {role_title}")
print("=" * 60)
print(output_text)
print("\n" + "-" * 60)
print(f"Latency:       {elapsed:.1f}s")
print(f"Input tokens:  {input_tokens}")
print(f"Output tokens: {output_tokens}")
print(f"Est. cost:     ${cost:.4f}")

Run it:

python 01_baseline_single_prompt.py

============================================================
BASELINE — Single Prompt JD: Senior ML Engineer
============================================================
### Job Description: Senior ML Engineer

#### Role Summary:
We are seeking a highly skilled and experienced Senior Machine Learning Engineer to join our dynamic team. The ideal candidate will have a strong background in machine learning, data science, and software engineering. The Senior ML Engineer will be responsible for developing, implementing, and maintaining machine learning models and algorithms to drive business solutions and improve product performance. This role requires a blend of technical expertise, leadership skills, and the ability to collaborate across cross-functional teams.

#### Responsibilities:
- **Model Development:** Design, develop, and deploy machine learning models and algorithms to solve complex business problems.
- **Data Analysis:** Perform exploratory data analysis to identify patterns, trends, and insights. Preprocess and clean data for model training.
- **Model Evaluation:** Evaluate the performance of machine learning models using appropriate metrics and techniques. Optimize models for accuracy, scalability, and efficiency.
- **Deployment:** Implement machine learning models into production environments, ensuring they are robust, scalable, and maintainable.
- **Collaboration:** Work closely with data scientists, software engineers, product managers, and other stakeholders to integrate machine learning solutions into existing systems.
- **Research:** Stay up-to-date with the latest advancements in machine learning and AI. Conduct research to identify new techniques and tools that can be applied to business problems.
- **Mentorship:** Provide guidance and mentorship to junior team members, fostering a culture of learning and innovation.
- **Documentation:** Create and maintain comprehensive documentation for machine learning models, algorithms, and processes.

#### Required Qualifications:
- **Education:** Master's or Ph.D. in Computer Science, Data Science, Statistics, or a related field.
- **Experience:** Minimum of 5 years of experience in machine learning engineering or a related role.
- **Technical Skills:**
  - Proficiency in programming languages such as Python, R, or Java.
  - Strong understanding of machine learning algorithms and techniques (e.g., supervised and unsupervised learning, deep learning).
  - Experience with machine learning frameworks and libraries (e.g., TensorFlow, PyTorch, Scikit-learn).
  - Solid understanding of data structures, algorithms, and software design principles.
  - Experience with cloud platforms (e.g., AWS, GCP, Azure) and containerization technologies (e.g., Docker, Kubernetes).
- **Data Handling:** Strong skills in data manipulation, preprocessing, and feature engineering.
- **Communication:** Excellent verbal and written communication skills.

#### Preferred Qualifications:
- **Industry Experience:** Experience in a relevant industry (e.g., finance, healthcare, e-commerce).
- **Advanced Degrees:** Ph.D. in a related field.
- **Publications:** Published research papers in machine learning or related conferences/journals.
- **Certifications:** Certifications in machine learning or data science (e.g., AWS Certified Machine Learning – Specialty).
- **Leadership Experience:** Previous experience in a leadership or mentorship role.

#### Salary Range:
- **Base Salary:** $120,000 - $160,000 per year, depending on experience and qualifications.
- **Bonus:** Performance-based annual bonus.
- **Stock Options:** Eligibility for company stock options.

#### Benefits:
- **Health Insurance:** Comprehensive medical, dental, and vision insurance plans.
- **Retirement:** 401(k) with company match.
- **Paid Time Off:** Generous paid time off, including vacation, sick leave, and holidays.
- **Professional Development:** Opportunities for continued learning and professional growth, including conference attendance and training programs.
- **Wellness Programs:** Access to wellness programs and resources.
- **Flexible Work Arrangements:** Options for remote work and flexible hours.
- **Employee Discounts:** Discounts on company products and services.

We are an equal opportunity employer and value diversity at our company. We do not discriminate on the basis of race, religion, color, national origin, gender, sexual orientation, age, marital status, veteran status, or disability status.

------------------------------------------------------------
Latency:       5.8s
Input tokens:  32
Output tokens: 828
Est. cost:     $0.0027

The output is a wall of markdown. Technically it covers the basics — title, responsibilities, qualifications. But look at what's missing: no skill taxonomy, no competency levels, no market-backed salary data, no separation between must-haves and nice-to-haves. The salary range ($120K–$160K) is a guess with no market justification. The qualifications are a flat dump where everything looks equally important.

32 input tokens, 828 output tokens, $0.0027. Fast and cheap. And you get what you pay for.

Step 4 — Build the 4-Agent Crew

Four agents, each with a specific role, wired in sequence so each one builds on what came before.

Save this as 02_job_architecture_crew.py:

"""
4-agent job architecture crew.

Takes a job title and produces a structured job description
backed by market research, a skill taxonomy, and a competency framework.
Each agent builds on the previous agent's output.
"""

import sys
import time
import json
from crewai import Agent, Task, Crew, Process, LLM
from models import (
    MarketResearchOutput,
    SkillTaxonomyOutput,
    CompetencyFrameworkOutput,
    JobDescriptionOutput,
)

# --- LLM setup ---

llm = LLM(model="bedrock/amazon.nova-pro-v1:0", temperature=0.7)

One LLM instance shared across all agents. The four agents each get a backstory that shapes how the LLM approaches the task:

market_researcher = Agent(
    role="Market Research Analyst",
    goal="Research market positioning, salary benchmarks, and industry demand for a given job title",
    backstory=(
        "You spent a decade at Mercer and Radford running compensation surveys "
        "and labor market analyses for Fortune 500 companies. You know how to "
        "benchmark roles across industries and geographies."
    ),
    llm=llm,
    verbose=True,
)

skills_analyst = Agent(
    role="Skills Taxonomy Analyst",
    goal="Map all required skills into a structured taxonomy with importance levels",
    backstory=(
        "Former head of skills architecture at LinkedIn, you built the skill "
        "ontology that powers their talent intelligence platform. You think in "
        "taxonomies — technical, soft, domain, tools, certifications — and always "
        "tag each skill with how critical it really is."
    ),
    llm=llm,
    verbose=True,
)

competency_designer = Agent(
    role="Competency Framework Designer",
    goal="Define proficiency levels and assessment criteria for each core competency",
    backstory=(
        "You designed competency models for Deloitte's human capital practice. "
        "Your frameworks map every competency from junior to principal level with "
        "concrete behavioral indicators and practical assessment methods."
    ),
    llm=llm,
    verbose=True,
)

jd_composer = Agent(
    role="Job Description Composer",
    goal="Synthesize all research into a polished, structured, bias-aware job description",
    backstory=(
        "You write job descriptions for a living — currently at a top HR-tech "
        "startup. You know what makes candidates click Apply: clear language, "
        "no jargon walls, honest requirements, and inclusive framing. You always "
        "separate must-haves from nice-to-haves because inflated requirements "
        "drive away qualified candidates."
    ),
    llm=llm,
    verbose=True,
)

The backstories matter more than you'd think. crewAI injects them into the system prompt for each agent, and they shape how the LLM approaches the task. A "compensation survey veteran" writes differently than a generic assistant — more specific numbers, more market awareness, less hand-waving.

The task definitions are where context connects everything:

def build_tasks(role_title: str) -> list[Task]:
    t1 = Task(
        description=(
            f"Research the market landscape for the '{role_title}' role. "
            f"Provide salary benchmarks (USD), demand level, common alternative titles, "
            f"typical team structure, industry context, and remote work trends. "
            f"Base your analysis on current market conditions."
        ),
        expected_output="Market research report with salary data, demand analysis, and industry positioning",
        agent=market_researcher,
        output_pydantic=MarketResearchOutput,
    )

    t2 = Task(
        description=(
            f"Using the market research provided, build a complete skill taxonomy for "
            f"the '{role_title}' role. Categorize skills into: technical, soft, domain, "
            f"tools/platforms, and certifications. Mark each as required, preferred, "
            f"or nice-to-have. Add brief context for why each skill matters."
        ),
        expected_output="Structured skill taxonomy with importance levels and context",
        agent=skills_analyst,
        output_pydantic=SkillTaxonomyOutput,
        context=[t1],
    )

    t3 = Task(
        description=(
            f"Using the market research and skill taxonomy, design a competency "
            f"framework for the '{role_title}' role. Define 4-6 core competencies, "
            f"each with proficiency levels from junior to lead/principal. Include "
            f"concrete assessment methods for each competency."
        ),
        expected_output="Competency framework with proficiency levels and assessment criteria",
        agent=competency_designer,
        output_pydantic=CompetencyFrameworkOutput,
        context=[t1, t2],
    )

    t4 = Task(
        description=(
            f"Synthesize all the research, skills, and competency data into a final "
            f"job description for the '{role_title}' role. The JD must be:\n"
            f"- Clear and jargon-free\n"
            f"- Bias-aware (avoid gendered language, unnecessary requirements)\n"
            f"- Structured with separate required vs preferred qualifications\n"
            f"- Include salary range from the market research\n"
            f"- Include a growth path and DEI commitment statement"
        ),
        expected_output="Complete, structured job description ready for posting",
        agent=jd_composer,
        output_pydantic=JobDescriptionOutput,
        context=[t1, t2, t3],
    )

    return [t1, t2, t3, t4]

crewAI serializes each task's Pydantic output and feeds it into the next agent's prompt. By task 4, the Composer sees all three prior outputs — market data, skills, and competency frameworks.

output_pydantic on each task tells crewAI to parse the LLM's response into the specified model. If the JSON doesn't match the schema, crewAI retries automatically. During my runs, it never needed a retry — Nova Pro handled the structured output on the first attempt every time.

The crew wires together with Process.sequential:

def run_crew(role_title: str):
    tasks = build_tasks(role_title)

    crew = Crew(
        agents=[market_researcher, skills_analyst, competency_designer, jd_composer],
        tasks=tasks,
        process=Process.sequential,
        verbose=True,
    )

    print(f"\n{'='*60}")
    print(f"  Job Architecture Crew — {role_title}")
    print(f"{'='*60}\n")

    start = time.time()
    result = crew.kickoff()
    elapsed = time.time() - start

    # Print structured output
    print(f"\n{'='*60}")
    print("  FINAL STRUCTURED OUTPUT")
    print(f"{'='*60}\n")

    if result.pydantic:
        print(json.dumps(result.pydantic.model_dump(), indent=2))
    else:
        print(result.raw)

    # Metrics
    usage = result.token_usage
    input_tokens = usage.prompt_tokens if usage else 0
    output_tokens = usage.completion_tokens if usage else 0
    total_tokens = usage.total_tokens if usage else 0

    cost = (input_tokens / 1000 * 0.0008) + (output_tokens / 1000 * 0.0032)

    print(f"\n{'-'*60}")
    print(f"Latency:        {elapsed:.1f}s")
    print(f"Input tokens:   {input_tokens:,}")
    print(f"Output tokens:  {output_tokens:,}")
    print(f"Total tokens:   {total_tokens:,}")
    print(f"Est. cost:      ${cost:.4f}")
    print(f"{'-'*60}")

    return result


if __name__ == "__main__":
    title = sys.argv[1] if len(sys.argv) > 1 else "Senior ML Engineer"
    run_crew(title)

Run it on the same role we used for the baseline:

python 02_job_architecture_crew.py "Senior ML Engineer"

Watch the verbose trace — you can see each agent pick up the prior agent's output and build on it. The prompt sizes grow visibly with each step as context accumulates. Here's the first agent kicking off:

============================================================
  Job Architecture Crew — Senior ML Engineer
============================================================

╭───────────────────────── 🚀 Crew Execution Started ──────────────────────────╮
│                                                                              │
│  Crew Execution Started                                                      │
│  Name:                                                                       │
│  crew                                                                        │
│  ID:                                                                         │
│  1acae4e9-a5e5-4bd7-bb28-c6c7e71b5949                                        │
│                                                                              │
│                                                                              │
╰──────────────────────────────────────────────────────────────────────────────╯

╭────────────────────────────── 📋 Task Started ───────────────────────────────╮
│                                                                              │
│  Task Started                                                                │
│  Name: Research the market landscape for the 'Senior ML Engineer' role.      │
│  Provide salary benchmarks (USD), demand level, common alternative titles,   │
│  typical team structure, industry context, and remote work trends. Base      │
│  your analysis on current market conditions.                                 │
│  ID: 5799e7d0-c79b-43a0-b40b-208ae84e1df0                                    │
│                                                                              │
│                                                                              │
╰──────────────────────────────────────────────────────────────────────────────╯

╭────────────────────────────── 🤖 Agent Started ──────────────────────────────╮
│                                                                              │
│  Agent: Market Research Analyst                                              │
│                                                                              │
│  Task: Research the market landscape for the 'Senior ML Engineer' role.      │
│  Provide salary benchmarks (USD), demand level, common alternative titles,   │
│  typical team structure, industry context, and remote work trends. Base      │
│  your analysis on current market conditions.                                 │
│                                                                              │
╰──────────────────────────────────────────────────────────────────────────────╯

╭─────────────────────────── ✅ Agent Final Answer ────────────────────────────╮
│                                                                              │
│  Agent: Market Research Analyst                                              │
│                                                                              │
│  Final Answer:                                                               │
│  {                                                                           │
│    "job_title": "Senior ML Engineer",                                        │
│    "alternative_titles": [                                                   │
│      "Principal Machine Learning Engineer",                                  │
│      "Lead Machine Learning Engineer",                                       │
│      "Staff Machine Learning Engineer",                                      │
│      "Machine Learning Scientist",                                           │
│      "Applied Machine Learning Engineer"                                     │
│    ],                                                                        │
│    "salary_range": {                                                         │
│      "currency": "USD",                                                      │
│      "min_annual": 120000,                                                   │
│      "max_annual": 220000,                                                   │
│      "median_annual": 170000                                                 │
│    },                                                                        │
│    "market_demand": "High. The demand for Senior ML Engineers is high        │
│  across various industries due to the increasing importance of data-driven   │
│  decision-making and automation. Companies are investing heavily in machine  │
│  learning capabilities to gain a competitive edge, driving up the demand     │
│  for skilled professionals in this field.",                                  │
│    ...                                                                       │
│  }                                                                           │
│                                                                              │
╰──────────────────────────────────────────────────────────────────────────────╯

The Skills Taxonomy Analyst, Competency Framework Designer, and Job Description Composer follow the same pattern — each picking up the previous agent's structured JSON and building on it. After all four finish, the final structured output:

============================================================
  FINAL STRUCTURED OUTPUT
============================================================

{
  "title": "Senior ML Engineer",
  "department": "Engineering",
  "summary": "We are seeking a Senior Machine Learning Engineer to join our team. You will design, implement, and optimize machine learning models to drive data-driven decision-making and automation across our organization. This role requires a blend of technical expertise, problem-solving skills, and the ability to collaborate with cross-functional teams.",
  "responsibilities": [
    "Design, implement, and optimize machine learning models.",
    "Perform statistical analysis to understand data distributions, evaluate models, and test hypotheses.",
    "Clean, transform, and prepare data for machine learning models.",
    "Assess the performance and generalizability of machine learning models.",
    "Collaborate with data scientists, software engineers, product managers, and domain experts.",
    "Explain complex machine learning concepts to non-technical stakeholders.",
    "Guide junior engineers and influence project direction.",
    "Stay updated with the latest machine learning technologies and methodologies."
  ],
  "required_qualifications": [
    "6-8 years of experience in machine learning engineering roles.",
    "Master's or Ph.D. in Computer Science, Data Science, or a related field.",
    "Proficiency in Machine Learning Algorithms.",
    "Strong skills in Statistical Analysis.",
    "Expertise in Data Preprocessing.",
    "Experience with Model Evaluation and Validation.",
    "Proficiency in Python, TensorFlow, PyTorch, and Scikit-learn.",
    "Excellent Problem-Solving skills.",
    "Strong Communication skills.",
    "Effective Collaboration skills."
  ],
  "preferred_qualifications": [
    "Experience with Deep Learning.",
    "Knowledge of Natural Language Processing.",
    "Familiarity with Reinforcement Learning.",
    "Leadership experience.",
    "Adaptability to new technologies and methodologies.",
    "Industry-Specific Knowledge.",
    "Business Acumen.",
    "Experience with Apache Spark.",
    "Experience with AWS/GCP/Azure.",
    "Certified Machine Learning Engineer (CMLEng).",
    "AWS Certified Machine Learning – Specialty.",
    "Google Professional Machine Learning Engineer."
  ],
  "competency_requirements": [
    "Expert in Machine Learning Algorithms.",
    "Expert in Statistical Analysis.",
    "Expert in Data Preprocessing.",
    "Expert in Model Evaluation and Validation.",
    "Expert Problem-Solver.",
    "Expert Communicator."
  ],
  "salary_range": "USD 120,000 - USD 220,000",
  "benefits_highlights": [
    "Comprehensive health, dental, and vision insurance.",
    "Retirement savings plan with company match.",
    "Paid time off and holidays.",
    "Professional development opportunities.",
    "Flexible work arrangement (remote/hybrid)."
  ],
  "growth_path": "This role offers a clear career progression path. Successful candidates can advance to Lead Machine Learning Engineer, Principal Machine Learning Engineer, or Director of Machine Learning, depending on their performance and career aspirations.",
  "dei_statement": "We are committed to creating a diverse environment and are proud to be an equal opportunity employer. All qualified applicants will receive consideration for employment without regard to race, color, religion, gender, gender identity or expression, sexual orientation, national origin, genetics, disability, age, or veteran status.",
  "work_arrangement": "This position offers a flexible work arrangement, including remote and hybrid options. Some onsite presence may be required for certain projects or team meetings."
}

------------------------------------------------------------
Latency:        22.3s
Input tokens:   32,648
Output tokens:  14,328
Total tokens:   46,976
Est. cost:      $0.0720
------------------------------------------------------------

The full pipeline end to end:

Step 5 — Compare the Results

Same role, same LLM, wildly different output.

The baseline gave us a generic markdown JD — responsibilities, qualifications, a guessed salary range. Flat text, no structure, nothing a system could parse.

The crew produced structured JSON. Salary data came with a market-backed range ($120,000–$220,000, $170,000 median) instead of a guess. Skills were categorized — 7 technical, 5 soft, 2 domain, 6 tools/platforms, 3 certifications — each tagged as required, preferred, or nice-to-have. The competency framework included proficiency levels from junior to lead with concrete assessment methods.

Qualifications were split into required vs. preferred, so candidates can self-assess honestly. Even a DEI statement and growth path showed up — fields the baseline didn't attempt at all.

To confirm this generalizes, I ran the crew on a completely different role:

python 02_job_architecture_crew.py "HR Business Partner"

Different universe. The salary range shifted to $70,000–$130,000. Technical skills gave way to employment law, HRIS platforms, and conflict resolution. Competencies like "change management" and "stakeholder management" replaced "deep learning" and "model evaluation." Same pipeline, completely different output shaped by the first agent's market research.

The numbers

	Baseline	4-Agent Crew
Latency	5.8s	22.3s
Input tokens	32	32,648
Output tokens	828	14,328
Total tokens	860	46,976
Est. cost	$0.0027	$0.0720

The crew is 27x more expensive and 3.8x slower. For $0.07 and 22 seconds, you get a structured, market-informed job architecture instead of a generic text dump. In a real HR workflow — where a bad JD means months of wrong candidates — that tradeoff isn't even close.

Conclusion

Four agents, one pipeline, $0.07 per role. Market data grounds the whole thing. From there, skills get tagged into a real taxonomy instead of a flat list, and proficiency levels mean something during interviews. The final JD is structured JSON a system can parse — not just prose for a human to skim.

The whole tutorial ran for under $0.14 on Nova Pro, including the baseline and two crew runs.

Where to take this next:

Add a web search tool to the Market Research agent so salary data comes from live sources instead of the LLM's training data
Wire the output into an ATS API (Greenhouse, Lever) to post JDs directly
Batch-process an entire department — feed in 10 role titles, get back a consistent job architecture with aligned leveling

All the code is on GitHub.

How to Build a Serverless AI Agent with Amazon Bedrock and Lambda

Sachin m — Sat, 21 Feb 2026 19:53:55 +0000

Last month I needed an internal tool that could answer HR questions — leave balances, policy lookups, team schedules. The obvious approach was a chatbot, but a plain LLM just hallucinates answers it doesn't have. It needs access to actual data.

Amazon Bedrock Agents solve this by letting an LLM call backend functions through what AWS calls "action groups." The LLM reads function descriptions, decides which one matches the user's question, extracts the right parameters from natural language, and calls the function. The entire thing runs serverless — no EC2, no containers, no servers to babysit.

By the end of this tutorial, you will have a working Bedrock Agent backed by a Lambda function that handles four HR operations: checking leave balances, submitting time-off requests, looking up company policies, and viewing team calendars. Total AWS cost to follow along: under $1.

Prerequisites

An AWS account with Amazon Bedrock access enabled
AWS CLI v2 installed and configured (aws configure)
Python 3.12+
boto3 1.35+ (pip install boto3)
Amazon Nova Pro model access enabled in the Bedrock console (us-east-1)

Step 1 — Create the IAM Roles

You need two roles: one for Lambda (so it can write logs) and one for the Bedrock Agent (so it can invoke the LLM).

First, create the Lambda trust policy. Save this as lambda-trust-policy.json:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": { "Service": "lambda.amazonaws.com" },
      "Action": "sts:AssumeRole"
    }
  ]
}

Create the role:

aws iam create-role \
  --role-name hr-leave-agent-lambda-role \
  --assume-role-policy-document file://lambda-trust-policy.json

{
    "Role": {
        "RoleName": "hr-leave-agent-lambda-role",
        "Arn": "arn:aws:iam::074095961149:role/hr-leave-agent-lambda-role",
        "AssumeRolePolicyDocument": {
            "Version": "2012-10-17",
            "Statement": [
                {
                    "Effect": "Allow",
                    "Principal": {
                        "Service": "lambda.amazonaws.com"
                    },
                    "Action": "sts:AssumeRole"
                }
            ]
        }
    }
}

Attach the basic execution policy so Lambda can write to CloudWatch:

aws iam attach-role-policy \
  --role-name hr-leave-agent-lambda-role \
  --policy-arn arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole

No output means it worked.

Now the Bedrock Agent role. This trust policy is slightly more involved — it restricts trust to Bedrock agents in your specific account. Save as agent-trust-policy.json:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": { "Service": "bedrock.amazonaws.com" },
      "Action": "sts:AssumeRole",
      "Condition": {
        "StringEquals": { "aws:SourceAccount": "YOUR_ACCOUNT_ID" },
        "ArnLike": { "aws:SourceArn": "arn:aws:bedrock:us-east-1:YOUR_ACCOUNT_ID:agent/*" }
      }
    }
  ]
}

Replace YOUR_ACCOUNT_ID with your 12-digit AWS account ID, then create the role:

aws iam create-role \
  --role-name hr-leave-agent-bedrock-role \
  --assume-role-policy-document file://agent-trust-policy.json

{
    "Role": {
        "RoleName": "hr-leave-agent-bedrock-role",
        "Arn": "arn:aws:iam::074095961149:role/hr-leave-agent-bedrock-role",
        "AssumeRolePolicyDocument": {
            "Version": "2012-10-17",
            "Statement": [
                {
                    "Effect": "Allow",
                    "Principal": {
                        "Service": "bedrock.amazonaws.com"
                    },
                    "Action": "sts:AssumeRole",
                    "Condition": {
                        "StringEquals": {
                            "aws:SourceAccount": "074095961149"
                        },
                        "ArnLike": {
                            "aws:SourceArn": "arn:aws:bedrock:us-east-1:074095961149:agent/*"
                        }
                    }
                }
            ]
        }
    }
}

The agent needs permission to invoke foundation models. Save this as invoke-model-policy.json:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": "bedrock:InvokeModel",
      "Resource": "arn:aws:bedrock:us-east-1::foundation-model/*"
    }
  ]
}

Attach it as an inline policy:

aws iam put-role-policy \
  --role-name hr-leave-agent-bedrock-role \
  --policy-name BedrockInvokeModelPolicy \
  --policy-document file://invoke-model-policy.json

Two roles, two policies. That's the IAM overhead for this entire project.

Step 2 — Write the Lambda Function

This is the backend the agent will call. In production you'd query a database — here we use mock data with five employees, leave balances, team calendars, and company policies.

Save as lambda_function.py:

import json
from datetime import datetime

EMPLOYEES = {
    "EMP001": {"name": "Priya Sharma", "team": "engineering", "pto_remaining": 12, "sick_remaining": 5, "role": "Senior Developer"},
    "EMP002": {"name": "James Chen", "team": "engineering", "pto_remaining": 3, "sick_remaining": 5, "role": "DevOps Engineer"},
    "EMP003": {"name": "Sarah Johnson", "team": "marketing", "pto_remaining": 8, "sick_remaining": 4, "role": "Content Manager"},
    "EMP004": {"name": "Raj Patel", "team": "engineering", "pto_remaining": 0, "sick_remaining": 2, "role": "Junior Developer"},
    "EMP005": {"name": "Maria Garcia", "team": "sales", "pto_remaining": 15, "sick_remaining": 5, "role": "Account Executive"},
}

TEAM_CALENDAR = {
    "engineering": {
        "2026-02": [
            {"employee_id": "EMP002", "name": "James Chen", "dates": "Feb 23-25", "type": "PTO"},
        ],
        "2026-03": [
            {"employee_id": "EMP001", "name": "Priya Sharma", "dates": "Mar 9-13", "type": "PTO"},
            {"employee_id": "EMP004", "name": "Raj Patel", "dates": "Mar 16", "type": "Sick"},
        ],
    },
    "marketing": {
        "2026-03": [
            {"employee_id": "EMP003", "name": "Sarah Johnson", "dates": "Mar 2-6", "type": "PTO"},
        ],
    },
    "sales": {
        "2026-03": [
            {"employee_id": "EMP005", "name": "Maria Garcia", "dates": "Mar 10-14", "type": "PTO"},
        ],
    },
}

POLICIES = {
    "pto": (
        "Annual PTO allowance: 20 days for full-time employees, accrued at 1.67 days/month. "
        "Requests of 1-2 days need 3 business days notice. Requests of 3+ days need 2 weeks notice. "
        "Unused PTO carries over up to 5 days into the next calendar year. "
        "No more than 10 consecutive business days without VP approval."
    ),
    "sick_leave": (
        "5 sick days per year, no advance notice needed but notify your manager by 9 AM. "
        "Doctor's note required if you're out 3+ consecutive days. Sick days don't carry over."
    ),
    "remote_work": (
        "Up to 2 days/week remote with manager approval. Core hours 10 AM - 4 PM ET. "
        "VPN required for all remote access. Full-time remote needs VP sign-off."
    ),
    "bereavement": (
        "5 paid days for immediate family (spouse, parent, child, sibling). "
        "3 paid days for extended family. Does not count against PTO."
    ),
    "parental": (
        "16 weeks fully paid for primary caregivers, 6 weeks for secondary. "
        "Notify HR at least 30 days before expected start date."
    ),
}


def check_leave_balance(employee_id):
    emp = EMPLOYEES.get(employee_id)
    if not emp:
        return {"error": f"No employee found with ID {employee_id}"}
    return {
        "employee_id": employee_id,
        "name": emp["name"],
        "pto_remaining": emp["pto_remaining"],
        "sick_remaining": emp["sick_remaining"],
        "pto_annual_total": 20,
        "sick_annual_total": 5,
    }


def submit_leave_request(employee_id, start_date, end_date, leave_type):
    emp = EMPLOYEES.get(employee_id)
    if not emp:
        return {"error": f"No employee found with ID {employee_id}"}

    leave_type = leave_type.lower()

    try:
        s = datetime.strptime(start_date, "%Y-%m-%d")
        e = datetime.strptime(end_date, "%Y-%m-%d")
        days = max(1, (e - s).days + 1)
    except ValueError:
        days = 1

    if leave_type in ("pto", "vacation") and emp["pto_remaining"] < days:
        return {
            "status": "denied",
            "reason": f"Not enough PTO. Requested {days} days but only {emp['pto_remaining']} remaining.",
            "employee_id": employee_id,
        }
    if leave_type in ("sick", "sick_leave") and emp["sick_remaining"] < days:
        return {
            "status": "denied",
            "reason": f"Not enough sick leave. Requested {days} days but only {emp['sick_remaining']} remaining.",
            "employee_id": employee_id,
        }

    req_id = f"LR-2026-{employee_id[-3:]}-{start_date.replace('-', '')}"
    return {
        "status": "submitted",
        "request_id": req_id,
        "employee_id": employee_id,
        "name": emp["name"],
        "leave_type": leave_type,
        "start_date": start_date,
        "end_date": end_date,
        "days_requested": days,
        "message": f"Leave request {req_id} submitted for manager approval.",
    }


def get_company_policy(topic):
    topic_clean = topic.lower().strip().replace(" ", "_")
    for key, text in POLICIES.items():
        if key in topic_clean or topic_clean in key:
            return {"topic": key, "policy": text}
    return {"error": f"No policy found for '{topic}'. Available: {', '.join(POLICIES.keys())}"}


def get_team_calendar(team_name, month):
    team = team_name.lower().strip()
    if team not in TEAM_CALENDAR:
        return {"error": f"Unknown team '{team_name}'. Available: {', '.join(TEAM_CALENDAR.keys())}"}

    cal = TEAM_CALENDAR[team]

    month_map = {
        "january": "01", "february": "02", "march": "03", "april": "04",
        "may": "05", "june": "06", "july": "07", "august": "08",
        "september": "09", "october": "10", "november": "11", "december": "12",
    }
    ml = month.lower().strip()

    for period, entries in cal.items():
        if ml in period or period in ml:
            return {"team": team_name, "month": period, "out_of_office": entries}
        for name, num in month_map.items():
            if name in ml and num in period:
                return {"team": team_name, "month": period, "out_of_office": entries}

    return {"team": team_name, "month": month, "out_of_office": [], "note": "Nobody scheduled off."}


FUNCTION_MAP = {
    "check_leave_balance": lambda p: check_leave_balance(p.get("employee_id", "")),
    "submit_leave_request": lambda p: submit_leave_request(
        p.get("employee_id", ""), p.get("start_date", ""),
        p.get("end_date", ""), p.get("leave_type", "pto"),
    ),
    "get_company_policy": lambda p: get_company_policy(p.get("topic", "")),
    "get_team_calendar": lambda p: get_team_calendar(p.get("team_name", ""), p.get("month", "")),
}


def lambda_handler(event, context):
    fn = event.get("function", "")
    params = {p["name"]: p["value"] for p in event.get("parameters", [])}

    handler = FUNCTION_MAP.get(fn)
    result = handler(params) if handler else {"error": f"Unknown function: {fn}"}

    return {
        "messageVersion": "1.0",
        "response": {
            "actionGroup": event.get("actionGroup", ""),
            "function": fn,
            "functionResponse": {
                "responseBody": {"TEXT": {"body": json.dumps(result)}}
            },
        },
    }

One thing about the request format: Bedrock Agents send parameters as a list of {name, value} pairs instead of a plain dictionary. The lambda_handler at the bottom flattens that into a dict and dispatches to the right function.

Package and deploy:

zip lambda_function.zip lambda_function.py

aws lambda create-function \
  --function-name hr-leave-agent \
  --runtime python3.12 \
  --role arn:aws:iam::YOUR_ACCOUNT_ID:role/hr-leave-agent-lambda-role \
  --handler lambda_function.lambda_handler \
  --zip-file fileb://lambda_function.zip \
  --timeout 30 \
  --memory-size 128

{
    "FunctionName": "hr-leave-agent",
    "FunctionArn": "arn:aws:lambda:us-east-1:074095961149:function:hr-leave-agent",
    "Runtime": "python3.12",
    "Handler": "lambda_function.lambda_handler",
    "CodeSize": 2576,
    "Timeout": 30,
    "MemorySize": 128,
    "State": "Pending",
    "StateReason": "The function is being created."
}

Now grant Bedrock permission to invoke this function:

aws lambda add-permission \
  --function-name hr-leave-agent \
  --statement-id AllowBedrockInvoke \
  --action lambda:InvokeFunction \
  --principal bedrock.amazonaws.com \
  --source-account YOUR_ACCOUNT_ID

Quick sanity check — invoke the function directly to confirm it works before wiring up the agent:

aws lambda invoke \
  --function-name hr-leave-agent \
  --cli-binary-format raw-in-base64-out \
  --payload '{"function":"check_leave_balance","parameters":[{"name":"employee_id","value":"EMP001"}],"actionGroup":"HRActions"}' \
  /tmp/test-output.json

{
    "StatusCode": 200,
    "ExecutedVersion": "$LATEST"
}

Check the response payload:

{
    "messageVersion": "1.0",
    "response": {
        "actionGroup": "HRActions",
        "function": "check_leave_balance",
        "functionResponse": {
            "responseBody": {
                "TEXT": {
                    "body": "{\"employee_id\": \"EMP001\", \"name\": \"Priya Sharma\", \"pto_remaining\": 12, \"sick_remaining\": 5, \"pto_annual_total\": 20, \"sick_annual_total\": 5}"
                }
            }
        }
    }
}

Correct data, correct format. Lambda is ready.

Step 3 — Create the Bedrock Agent

The agent needs three things: an LLM, instructions, and an action group that maps to the Lambda function.

aws bedrock-agent create-agent \
  --agent-name hr-leave-agent \
  --agent-resource-role-arn arn:aws:iam::YOUR_ACCOUNT_ID:role/hr-leave-agent-bedrock-role \
  --foundation-model amazon.nova-pro-v1:0 \
  --instruction "You are an HR assistant for a mid-size tech company. You help employees check their leave balances, submit time-off requests, look up company policies, and view team calendars. Be concise and helpful. When an employee asks about taking time off, check their balance first before submitting."

{
    "agent": {
        "agentId": "SUEO6W3BDO",
        "agentName": "hr-leave-agent",
        "agentStatus": "CREATING",
        "foundationModel": "amazon.nova-pro-v1:0"
    }
}

Save the agentId — you need it for every subsequent command. Mine is SUEO6W3BDO.

Now the action group. The function schema here is the critical piece — the LLM reads these descriptions to decide which function to call, so vague descriptions mean wrong routing.

aws bedrock-agent create-agent-action-group \
  --agent-id YOUR_AGENT_ID \
  --agent-version DRAFT \
  --action-group-name HRActions \
  --action-group-executor '{"lambda":"arn:aws:lambda:us-east-1:YOUR_ACCOUNT_ID:function:hr-leave-agent"}' \
  --function-schema '{
    "functions": [
      {
        "name": "check_leave_balance",
        "description": "Look up how many PTO and sick days an employee has remaining.",
        "parameters": {
          "employee_id": {"type": "string", "description": "Employee ID, e.g. EMP001", "required": true}
        }
      },
      {
        "name": "submit_leave_request",
        "description": "Submit a new leave request. Returns confirmation or denial based on available balance.",
        "parameters": {
          "employee_id": {"type": "string", "description": "Employee ID", "required": true},
          "start_date": {"type": "string", "description": "Start date YYYY-MM-DD", "required": true},
          "end_date": {"type": "string", "description": "End date YYYY-MM-DD", "required": true},
          "leave_type": {"type": "string", "description": "pto, sick, bereavement, or parental", "required": true}
        }
      },
      {
        "name": "get_company_policy",
        "description": "Retrieve company policy for a topic: pto, sick_leave, remote_work, bereavement, or parental.",
        "parameters": {
          "topic": {"type": "string", "description": "Policy topic", "required": true}
        }
      },
      {
        "name": "get_team_calendar",
        "description": "Check who on a team is out of office during a given month.",
        "parameters": {
          "team_name": {"type": "string", "description": "engineering, marketing, or sales", "required": true},
          "month": {"type": "string", "description": "Month, e.g. March or 2026-03", "required": true}
        }
      }
    ]
  }'

{
    "agentActionGroup": {
        "actionGroupName": "HRActions",
        "actionGroupId": "VR1HJRPO25",
        "actionGroupState": "ENABLED",
        "functionSchema": {
            "functions": [
                {"name": "check_leave_balance", "requireConfirmation": "DISABLED"},
                {"name": "submit_leave_request", "requireConfirmation": "DISABLED"},
                {"name": "get_company_policy", "requireConfirmation": "DISABLED"},
                {"name": "get_team_calendar", "requireConfirmation": "DISABLED"}
            ]
        }
    }
}

Four functions registered. Now prepare the agent — this compiles everything and makes it invocable:

aws bedrock-agent prepare-agent --agent-id YOUR_AGENT_ID

{
    "agentId": "SUEO6W3BDO",
    "agentStatus": "PREPARING"
}

Give it 10-15 seconds, then check the status:

aws bedrock-agent get-agent --agent-id YOUR_AGENT_ID \
  --query "agent.agentStatus" --output text

PREPARED

The agent is live.

Step 4 — Test the Agent

Save this as invoke_agent.py:

import boto3
import uuid

client = boto3.client("bedrock-agent-runtime", region_name="us-east-1")

AGENT_ID = "YOUR_AGENT_ID"
AGENT_ALIAS_ID = "TSTALIASID"  # built-in test alias

def invoke(prompt, session_id=None):
    if not session_id:
        session_id = str(uuid.uuid4())

    resp = client.invoke_agent(
        agentId=AGENT_ID,
        agentAliasId=AGENT_ALIAS_ID,
        sessionId=session_id,
        inputText=prompt,
    )

    answer = ""
    for event in resp["completion"]:
        if "chunk" in event:
            answer += event["chunk"]["bytes"].decode("utf-8")

    print(f"\nUser: {prompt}")
    print(f"Agent: {answer}")
    return session_id

# Test 1: Leave balance lookup
invoke("How many PTO days does employee EMP001 have left?")

# Test 2: Policy lookup
invoke("What's the remote work policy?")

# Test 3: Team calendar
invoke("Who on the engineering team is out in March?")

# Test 4: Leave request submission
invoke("Submit a PTO request for EMP003 from March 20 to March 24, 2026.")

# Test 5: Multi-turn — agent remembers context
sid = invoke("I'm employee EMP004. How much PTO do I have?")
invoke("Can I take March 10 off as PTO?", session_id=sid)

Replace YOUR_AGENT_ID with your agent ID, then run it:

python invoke_agent.py

User: How many PTO days does employee EMP001 have left?
Agent: Employee EMP001 has 12 PTO days left.

User: What's the remote work policy?
Agent: The remote work policy allows up to 2 days per week of remote work with manager
approval. Core hours are 10 AM - 4 PM ET. VPN is required for all remote access.
Full-time remote work needs VP sign-off.

User: Who on the engineering team is out in March?
Agent: In March, Priya Sharma (EMP001) is out from Mar 9-13 for PTO and Raj Patel
(EMP004) is out on Mar 16 for Sick leave.

User: Submit a PTO request for EMP003 from March 20 to March 24, 2026.
Agent: Your PTO request from March 20 to March 24, 2026, has been successfully submitted
for manager approval. You will receive a confirmation email within 24 hours.

User: I'm employee EMP004. How much PTO do I have?
Agent: Raj Patel, you currently have 0 days of PTO remaining this year. Your annual PTO
total is 20 days. If you need to take time off, you may want to consider using your
2 remaining sick days or discuss other options with your manager.

User: Can I take March 10 off as PTO?
Agent: Unfortunately, you don't have any PTO days remaining to take March 10 off.
You could use one of your 2 remaining sick days if applicable, or speak with your
manager about other arrangements.

The agent didn't just return raw data — it interpreted it. When EMP004 asked about PTO with a balance of zero, the agent suggested alternatives on its own (sick days, talking to the manager). Nobody told it to do that. It inferred the right behavior from the instruction "be helpful" combined with the data it got back.

The multi-turn test (Test 5) shows session memory working. The second message ("Can I take March 10 off?") doesn't mention EMP004 — the agent remembered from the first message in the same session.

And for the denial scenario — I ran one more test separately to make sure the validation logic works end to end:

User: Submit a PTO request for EMP004 from March 10 to March 12, 2026.
Agent: Sorry, Raj Patel (EMP004) does not have any PTO days remaining for the current
year. Therefore, the PTO request from March 10 to March 12, 2026, cannot be submitted.
Please consider using sick leave if applicable or discuss with your manager about taking
unpaid leave as an alternative.

The Lambda function caught the zero balance and returned a denial. The agent turned that into a clear explanation with alternatives — no special error-handling logic needed on our side.

Step 5 — Clean Up

To avoid any charges, delete everything in reverse order:

# Delete the agent (this removes action groups too)
aws bedrock-agent delete-agent --agent-id YOUR_AGENT_ID

# Delete the Lambda function
aws lambda delete-function --function-name hr-leave-agent

# Delete the Bedrock role (remove inline policy first)
aws iam delete-role-policy --role-name hr-leave-agent-bedrock-role \
  --policy-name BedrockInvokeModelPolicy
aws iam delete-role --role-name hr-leave-agent-bedrock-role

# Delete the Lambda role (detach managed policy first)
aws iam detach-role-policy --role-name hr-leave-agent-lambda-role \
  --policy-arn arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole
aws iam delete-role --role-name hr-leave-agent-lambda-role

What You Could Build Next

This tutorial used mock data, but the pattern doesn't change when you swap in real backends:

Point the Lambda at a DynamoDB table instead of Python dictionaries and you've got a real HR bot
Swap the backend for an external API and the same agent handles order status lookups
Wire up a SQL query layer for a data analyst agent (this one's on my list to build next)

You could also add a second action group to the same agent — for example, an "ITSupport" group that handles password resets and laptop requests alongside the HR functions. One agent, multiple domains.

For the next level, look into Bedrock Agent's built-in session memory (so conversations persist beyond a single session) and guardrails (so the agent stays within bounds on sensitive topics like salary data).

The companion code for this tutorial is on GitHub.

How to Implement Prompt Caching on Amazon Bedrock and Cut Inference Costs in Half

Sachin m — Fri, 20 Feb 2026 06:49:00 +0000

Introduction

You're running a multi-turn support agent on Amazon Bedrock. Every API call sends a ~2,100-token system prompt — your agent's persona, rules, and the product documentation — along with the growing conversation history. The model doesn't remember any of this between calls. It reprocesses those tokens fresh every single turn, and you pay for every one of them.

For a single five-turn conversation on Nova Pro, that adds up to 12,834 input tokens. Over 80% of that is the static system prompt, repeated identically across all five turns. Scale to 1,000 conversations a day and your monthly bill hits $384. Most of that is money spent processing the same static text, over and over.

Amazon Bedrock's prompt caching fixes this. You mark a cache point in your prompt where the static content ends. Bedrock stores everything before that marker. On subsequent calls within the cache window, it reads from cache instead of reprocessing. Cache reads cost 90% less than regular input tokens.

I ran benchmarks across three Amazon Nova models to measure the real impact. Adding a single cachePoint to the system block cut Nova Pro's monthly projection roughly in half. And when I combined prompt caching with switching from Nova Pro to Nova Micro, the total reduction hit 97%. From $384 a month to under $10.

By the end of this tutorial, you will have:

Built a multi-turn customer support agent on Bedrock
Measured what it actually costs per conversation — baseline numbers
Added prompt caching (one line of code) and seen the difference
Run the same benchmark across Nova Pro, Lite, and Micro to compare
Set up CloudWatch monitoring so you know caching is actually working

All code is available in the companion repository: bedrock-prompt-caching-distillation-tutorial.

Without caching, every API call reprocesses the full system prompt at full price. With caching, the first call stores it, and subsequent calls read it back at a 90% discount:

Prerequisites

An AWS account with Amazon Bedrock access enabled in us-east-1
Model access granted for Amazon Nova Pro, Nova Lite, and Nova Micro (enabled by default for new accounts). Open the Bedrock Model Catalog to confirm they're listed:

Python 3.11+ with boto3 >= 1.35.76 (prompt caching support requires recent versions)
AWS CLI v2 configured with credentials that have bedrock-runtime:* permissions
Estimated cost to complete this tutorial: under $0.15

python3 --version   # 3.11+
pip show boto3 | grep Version   # 1.35.76+
aws bedrock list-foundation-models \
  --query "modelSummaries[?modelId=='amazon.nova-pro-v1:0'].modelId" \
  --output text
# Expected: amazon.nova-pro-v1:0

Step 1 — Building a Realistic Baseline

Before optimizing anything, you need to know what you're spending. We'll build a customer support agent that answers questions using product documentation — basically what every Bedrock-powered support bot looks like under the hood.

The scenario

Your agent has a system prompt (~200 tokens) defining its persona and rules, plus product documentation (~1,900 tokens) pasted right into the system block — features, pricing, troubleshooting, the works. Then there's the conversation history, which grows with every turn.

The combined system content is 2,130 tokens. Every API call resends all of it, plus the growing conversation history. By Turn 5, the model is processing 2,860 input tokens per call — and the 2,130-token prefix hasn't changed since Turn 1.

The baseline code

The product documentation is an ~1,900-token spec for a fictional SaaS product called SmartWidget Pro — features, pricing, API docs, migration guides, troubleshooting. The full text is in product_docs.txt in the companion repo.

# 01_baseline_no_cache.py

import boto3
import time
from pathlib import Path

bedrock = boto3.client("bedrock-runtime", region_name="us-east-1")

MODEL_ID = "amazon.nova-pro-v1:0"

SYSTEM_PROMPT = """You are a senior customer support agent for SmartWidget, a SaaS company.

Rules:
- Always be polite, professional, and concise
- Reference the product documentation when answering
- If the answer is not in the documentation, say so honestly
- Format responses with bullet points for clarity
- Never make up product features or pricing that isn't documented
- Keep responses under 150 words unless the question requires more detail
"""

# Load product docs (~1,900 tokens)
PRODUCT_DOCS = Path("product_docs.txt").read_text()
FULL_SYSTEM = SYSTEM_PROMPT + "\n\n--- PRODUCT DOCUMENTATION ---\n\n" + PRODUCT_DOCS

# Realistic customer support conversation
QUESTIONS = [
    "What are the main features of SmartWidget Pro?",
    "How do I configure the API integration? Give me a quick start guide.",
    "What's the pricing for enterprise customers?",
    "My API is returning 429 errors. How do I fix this?",
    "How do I migrate from v3.x to v4.2? What are the breaking changes?",
]

def ask_question(question, conversation_history):
    messages = conversation_history + [
        {"role": "user", "content": [{"text": question}]}
    ]

    start_time = time.time()
    response = bedrock.converse(
        modelId=MODEL_ID,
        system=[{"text": FULL_SYSTEM}],
        messages=messages,
        inferenceConfig={"maxTokens": 512, "temperature": 0.1},
    )
    elapsed = time.time() - start_time
    usage = response["usage"]

    print(f"  Latency: {elapsed:.2f}s | Input: {usage['inputTokens']} | Output: {usage['outputTokens']}")
    return response, elapsed

# Run 5-turn conversation
history = []
for i, question in enumerate(QUESTIONS):
    print(f"Turn {i+1}: {question}")
    response, latency = ask_question(question, history)
    history.append({"role": "user", "content": [{"text": question}]})
    history.append({"role": "assistant", "content": response["output"]["message"]["content"]})

Real output

Here's what a five-turn conversation looks like without caching. These are real numbers from our AWS account:

Total across five turns: 12,834 input tokens and 796 output tokens in 8.78 seconds.

Notice how input tokens grow each turn — from 2,140 to 2,860. That's the conversation history accumulating. But the first ~2,130 tokens in every call are the same system prompt and product docs, unchanged from Turn 1.

What this costs at scale

Nova Pro charges $0.80 per million input tokens and $3.20 per million output tokens. For this single conversation:

Input cost: 12,834 tokens × $0.80/M = $0.0103
Output cost: 796 tokens × $3.20/M = $0.0025
Total: $0.0128 per conversation

At 1,000 conversations per day, 30 days a month: $384 per month. For one model running one workflow.

Step 2 — Adding Prompt Caching

The change is small — one additional element in the system array. Everything before the cachePoint marker gets stored. Subsequent calls that send the same prefix read from cache instead of reprocessing it.

If you want to try this visually first, the Bedrock Chat Playground has a prompt caching toggle built in. Select your model, scroll down in the left panel, and flip the Prompt caching switch:

Good for quick experiments, but you'll want this in code for production.

The code change

Compare this to the baseline — the only difference is two lines in the system parameter:

# 02_with_prompt_caching.py

def ask_question_cached(question, conversation_history):
    messages = conversation_history + [
        {"role": "user", "content": [{"text": question}]}
    ]

    start_time = time.time()
    response = bedrock.converse(
        modelId=MODEL_ID,
        system=[
            {"text": FULL_SYSTEM},
            {"cachePoint": {"type": "default"}},   # <-- this is the only change
        ],
        messages=messages,
        inferenceConfig={"maxTokens": 512, "temperature": 0.1},
    )
    elapsed = time.time() - start_time
    usage = response["usage"]

    cache_read = usage.get("cacheReadInputTokens", 0)
    cache_write = usage.get("cacheWriteInputTokens", 0)
    print(f"  Latency: {elapsed:.2f}s | Input: {usage['inputTokens']} | "
          f"Output: {usage['outputTokens']} | Cache read: {cache_read} | Cache write: {cache_write}")

    return response, elapsed

That {"cachePoint": {"type": "default"}} tells Bedrock: everything above this marker is static content. Cache it.

Real output with caching

Look at Turn 1 first. Cache WRITE tokens: 2130 and Cache READ tokens: 0 — the prefix has to be stored before it can be reused. This first call is the setup cost.

From Turn 2 onwards, every call shows Cache READ tokens: ~2,145 and Cache WRITE tokens: 0. The system prefix is being read from cache at 90% discount instead of reprocessed.

The input token counts tell the same story. Turn 1 reports only 10 input tokens (just the user question) instead of 2,140 — the other 2,130 show up in cacheWriteInputTokens. Turns 2–5 only count the conversation history and new question as input. The prefix tokens move to cacheReadInputTokens.

Cost breakdown

With caching, the billing splits into four components:

Component	Tokens	Rate (per 1M)	Cost
Non-cached input	1,738	$0.80	$0.001390
Cache read	8,581	$0.08 (90% off)	$0.000686
Cache write	2,130	$1.00 (25% premium)	$0.002130
Output	588	$3.20	$0.001882
Total			$0.006089

Cache reads are billed at $0.08 per million — one-tenth of the regular $0.80 input price. Cache writes carry a 25% premium at $1.00 per million, so Turn 1 actually costs slightly more than a non-cached call. But by Turn 2, the write premium is already paid off. Each subsequent cache read saves over $0.001 compared to reprocessing those tokens.

Side-by-side comparison

Metric	Baseline	Cached	Change
Non-cached input tokens	12,834	1,738	-86%
Cost per conversation	$0.0128	$0.0061	-52%
Monthly cost (1K convos/day)	$384	$183	-52%
Monthly savings	—	$201/month

Don't expect a latency win — caching saves money, not time. The model still processes user messages and generates responses at the same speed. Turn 1 actually runs a bit slower due to the cache write overhead. The win is purely in token billing: those 2,130 tokens of system content go from $0.80/M to $0.08/M on every cache hit.

Key rule: The content before the cache point must be byte-for-byte identical across requests. If you change even one character in the system prompt, it's a cache miss and a new cache write. Keep your cached prefix genuinely static.

Step 3 — Comparing Across Model Tiers

Prompt caching works on all Amazon Nova models. But per-token pricing is wildly different between tiers, and so are the absolute savings. I ran the same five-turn benchmark on Nova Pro, Nova Lite, and Nova Micro.

Full comparison table

These numbers are from a single benchmark run across all three models, using identical system prompts and questions:

Model	Baseline Monthly	Cached Monthly	Cost Reduction	Input Price
Nova Pro	$334.61	$169.99	49%	$0.80/M
Nova Lite	$30.33	$18.41	39%	$0.06/M
Nova Micro	$16.99	$9.47	44%	$0.035/M

Caching saves 39–49% regardless of model tier. You might notice the Nova Pro number here (49%) differs slightly from the 52% in Step 2 — that's because these are separate benchmark runs with slightly different response lengths. The pattern holds: caching cuts costs by roughly half on Nova Pro and 40–45% on the cheaper tiers. The percentage is lower on cheaper models because the cache write premium ($1.00/M write vs $0.80/M regular input for Pro) has a larger relative impact when the base price is already low.

The real optimization: caching + model selection

Here's the number that matters for production. If your baseline uses Nova Pro without caching, and you switch to Nova Micro with caching, the combined savings are:

$334.61/month → $9.47/month — a 97% reduction.

That's not a typo. Most of the savings come from the model switch — Nova Micro's input tokens cost 23x less than Nova Pro's. Caching cuts the remaining bill roughly in half on top of that. For many customer support and document Q&A workloads, Nova Micro with caching delivers sufficient quality at a fraction of the cost.

Obviously, model quality matters. You shouldn't blindly switch from Pro to Micro. Test your specific use case, measure output quality, pick the cheapest model that meets your bar. But model selection and prompt caching together get you far bigger savings than either one alone.

Step 4 — Caching Tool Definitions for Agentic Workflows

If your agent uses tools, those JSON schema definitions are resent with every API call — just like the system prompt. Cache them too.

You can place up to four cache points per request. A common pattern is two: one after the system content, one after the tool definitions.

# Cache both system content and tool definitions

response = bedrock.converse(
    modelId="amazon.nova-pro-v1:0",
    system=[
        {"text": FULL_SYSTEM},
        {"cachePoint": {"type": "default"}},      # Cache point 1: system + docs
    ],
    messages=messages,
    toolConfig={
        "tools": [
            {
                "toolSpec": {
                    "name": "lookup_order",
                    "description": "Look up a customer order by order ID",
                    "inputSchema": {
                        "json": {
                            "type": "object",
                            "properties": {
                                "order_id": {
                                    "type": "string",
                                    "description": "The order ID to look up"
                                }
                            },
                            "required": ["order_id"]
                        }
                    }
                }
            },
            {
                "toolSpec": {
                    "name": "check_inventory",
                    "description": "Check inventory for a product SKU",
                    "inputSchema": {
                        "json": {
                            "type": "object",
                            "properties": {
                                "sku": {"type": "string", "description": "Product SKU"}
                            },
                            "required": ["sku"]
                        }
                    }
                }
            },
            {"cachePoint": {"type": "default"}},   # Cache point 2: tool definitions
        ]
    },
    inferenceConfig={"maxTokens": 512},
)

Cache point rules to know

The content before a cache point needs at least 1,024 tokens for Nova and Claude Sonnet, or 2,048 tokens for Haiku. Below that, the cache point is silently ignored — no error, just no caching. I missed this initially and spent 10 minutes wondering why my short test prompt wasn't caching.

You can place up to four cache points per request — system, tools, and up to two more in your messages. The default TTL is 5 minutes; if no calls hit the same prefix in that window, the entry expires.

One detail that matters at scale: cached tokens don't count against your rate limits. They bypass the per-model token-per-minute throttle, which is useful if you're running near your provisioned limit.

Step 5 — Monitoring Cache Performance in Production

In production, you need to track whether caching is actually working. A misconfigured prompt that changes the prefix on every call will silently miss the cache, and you'll pay full price without realizing it.

Publishing cache metrics to CloudWatch

# 08_observability.py

import boto3

cloudwatch = boto3.client("cloudwatch", region_name="us-east-1")

def publish_cache_metrics(usage, model_id):
    """Publish cache performance metrics after each Bedrock call."""
    cache_read = usage.get("cacheReadInputTokens", 0)
    cache_write = usage.get("cacheWriteInputTokens", 0)
    total_input = usage["inputTokens"] + cache_read + cache_write

    hit_rate = (cache_read / total_input * 100) if total_input > 0 else 0

    cloudwatch.put_metric_data(
        Namespace="BedrockLLMOps",
        MetricData=[
            {
                "MetricName": "CacheHitTokens",
                "Value": cache_read,
                "Unit": "Count",
                "Dimensions": [{"Name": "ModelId", "Value": model_id}],
            },
            {
                "MetricName": "CacheHitRate",
                "Value": hit_rate,
                "Unit": "Percent",
                "Dimensions": [{"Name": "ModelId", "Value": model_id}],
            },
        ],
    )

Setting up alerts

A sudden drop in cache hit rate usually means something changed in your prompt prefix — maybe a deployment modified the system prompt, or a dynamic value crept into what should be static content.

cloudwatch.put_metric_alarm(
    AlarmName="LowCacheHitRate",
    MetricName="CacheHitRate",
    Namespace="BedrockLLMOps",
    Statistic="Average",
    Period=300,
    EvaluationPeriods=3,
    Threshold=70,
    ComparisonOperator="LessThanThreshold",
    AlarmActions=["arn:aws:sns:us-east-1:YOUR_ACCOUNT_ID:llmops-alerts"],
)

This fires if the average cache hit rate drops below 70% for three consecutive five-minute windows. In a healthy multi-turn system, you should see hit rates above 80% — the only misses should be Turn 1 of each conversation (the initial cache write).

Conclusion

One line of code, roughly half off. That's the short version. Adding a cachePoint to our system block cut Nova Pro's monthly bill from $384 to $183, no infrastructure changes, no output quality difference. Across all three Nova tiers, savings landed between 39% and 52%.

The bigger win is combining caching with model selection. Nova Pro without caching to Nova Micro with caching: $384 down to under $10. A 97% reduction that mostly comes from the model switch, with caching shaving off the rest.

A few gotchas before you ship this:

Cache writes cost 25% more than regular input. Caching only pays for itself after the second call using the same prefix — so for single-turn workflows with no reuse, skip it.

The default TTL is 5 minutes. Works fine for steady traffic. Bursty conversations with long gaps between turns will see more cache misses. Claude models on Bedrock support a 1-hour TTL if you need it.

Monitor your cache hit rate. A misconfigured prompt that changes the prefix between calls will miss the cache every time, silently. The CloudWatch alert from Step 5 catches this.

Where to go next

If you want to push the cost savings further, model distillation on Bedrock lets you train a smaller student model on a larger teacher's outputs — purpose-built for your specific task. I'll cover that in a follow-up tutorial.

For production hardening, look into Bedrock Guardrails for content policy enforcement and LLM-as-a-Judge pipelines to continuously validate output quality as you swap models and tune prompts.

The complete code for this tutorial is available at: bedrock-prompt-caching-distillation-tutorial

Appendix: IAM Policy for This Tutorial

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "bedrock:InvokeModel",
                "bedrock:InvokeModelWithResponseStream"
            ],
            "Resource": "arn:aws:bedrock:us-east-1::foundation-model/amazon.nova-*"
        },
        {
            "Effect": "Allow",
            "Action": [
                "cloudwatch:PutMetricData",
                "cloudwatch:PutMetricAlarm"
            ],
            "Resource": "*"
        }
    ]
}

Appendix: Cost Reference

All pricing is per 1 million tokens as of February 2026. Verify current pricing at the Amazon Bedrock pricing page.

Model	Input	Output	Cache Read (90% off)	Cache Write (25% premium)
Nova Pro	$0.80	$3.20	$0.08	$1.00
Nova Lite	$0.06	$0.24	$0.006	$0.075
Nova Micro	$0.035	$0.14	$0.0035	$0.044

How to Implement Prompt Caching on Amazon Bedrock and Cut Inference Costs in Half

Sachin m — Thu, 19 Feb 2026 09:40:08 +0000

You're running a multi-turn support agent on Amazon Bedrock. Every API call sends a 2,069-token system prompt — your agent's persona, rules, and the product documentation — along with the growing conversation history. The model doesn't remember any of this between calls. It reprocesses those 2,069 tokens fresh every single turn, and you pay for every one of them.

For a single five-turn conversation on Nova Pro, that adds up to 11,814 input tokens. Over half of that is the static system prompt, repeated identically across all five turns. Scale to 1,000 conversations a day and your monthly bill hits $335. The system prompt alone accounts for roughly $150 of that — money spent processing the same static text, over and over.

By the end of this tutorial, you will have:

Built a multi-turn customer support agent on Bedrock
Measured what it actually costs per conversation — baseline numbers
Added prompt caching (one line of code) and seen the difference
Run the same benchmark across Nova Pro, Lite, and Micro to compare
Set up CloudWatch monitoring so you know caching is actually working

sachinm207 / bedrock-prompt-caching-distillation-tutorial

Tutorial: Cut LLM inference costs by up to 90% using Amazon Bedrock prompt caching. Companion code for blog post.

Bedrock Prompt Caching Tutorial

Cut LLM inference costs by up to 90% using Amazon Bedrock prompt caching and model tier selection.

What You'll Build

A production-optimized customer support agent that uses:

Prompt caching to eliminate redundant token processing (39–54% cost reduction)
Cross-model benchmarking across Nova Pro, Lite, and Micro
CloudWatch observability to monitor cache hit rates in production

Project Structure

.
├── 01_baseline_no_cache.py          # Baseline: no optimization
├── 02_with_prompt_caching.py        # Add prompt caching (Converse API)
├── run_all_benchmarks.py            # Run all models × baseline/cached
├── product_docs.txt                 # Sample product documentation (~2,069 tokens)
├── results_01_baseline.json         # Nova Pro baseline results
├── results_02_cached.json           # Nova Pro cached results
├── results_full_nova.json           # All 3 Nova models comparison
└── requirements.txt                 # Python dependencies

Prerequisites

AWS account with Bedrock access enabled in us-east-1
Model access for Amazon Nova Pro, Nova Lite, and Nova Micro (enabled by default)
Python 3.11+ with boto3 >= 1.35.76
AWS CLI v2…

View on GitHub

Without caching, every API call reprocesses the full system prompt at full price. With caching, the first call stores it, and subsequent calls read it back at a 90% discount:

Prerequisites

An AWS account with Amazon Bedrock access enabled in us-east-1
Model access granted for Amazon Nova Pro, Nova Lite, and Nova Micro (enabled by default for new accounts). Open the Bedrock Model Catalog to confirm they're listed:

Python 3.11+ with boto3 >= 1.35.76 (prompt caching support requires recent versions)
AWS CLI v2 configured with credentials that have bedrock-runtime:* permissions
Estimated cost to complete this tutorial: under $0.15

python3 --version   # 3.11+
pip show boto3 | grep Version   # 1.35.76+
aws bedrock list-foundation-models \
  --query "modelSummaries[?modelId=='amazon.nova-pro-v1:0'].modelId" \
  --output text
# Expected: amazon.nova-pro-v1:0

Step 1 — Building a Realistic Baseline

The scenario

The combined system content is 2,069 tokens. Every API call resends all of it, plus the growing conversation history. By Turn 5, the model is processing 2,597 input tokens per call — and the 2,069-token prefix hasn't changed since Turn 1.

The baseline code

# 01_baseline_no_cache.py

import boto3
import time
from pathlib import Path

bedrock = boto3.client("bedrock-runtime", region_name="us-east-1")

MODEL_ID = "amazon.nova-pro-v1:0"

SYSTEM_PROMPT = """You are a senior customer support agent for SmartWidget, a SaaS company.

Rules:
- Always be polite, professional, and concise
- Reference the product documentation when answering
- If the answer is not in the documentation, say so honestly
- Format responses with bullet points for clarity
- Never make up product features or pricing that isn't documented
- Keep responses under 150 words unless the question requires more detail
"""

# Load product docs (~1,900 tokens)
PRODUCT_DOCS = Path("product_docs.txt").read_text()
FULL_SYSTEM = SYSTEM_PROMPT + "\n\n--- PRODUCT DOCUMENTATION ---\n\n" + PRODUCT_DOCS

# Realistic customer support conversation
QUESTIONS = [
    "What are the main features of SmartWidget Pro?",
    "How do I configure the API integration? Give me a quick start guide.",
    "What's the pricing for enterprise customers?",
    "My API is returning 429 errors. How do I fix this?",
    "How do I migrate from v3.x to v4.2? What are the breaking changes?",
]

def ask_question(question, conversation_history):
    messages = conversation_history + [
        {"role": "user", "content": [{"text": question}]}
    ]

    start_time = time.time()
    response = bedrock.converse(
        modelId=MODEL_ID,
        system=[{"text": FULL_SYSTEM}],
        messages=messages,
        inferenceConfig={"maxTokens": 512, "temperature": 0.1},
    )
    elapsed = time.time() - start_time
    usage = response["usage"]

    print(f"  Latency: {elapsed:.2f}s | Input: {usage['inputTokens']} | Output: {usage['outputTokens']}")
    return response, elapsed

# Run 5-turn conversation
history = []
for i, question in enumerate(QUESTIONS):
    print(f"Turn {i+1}: {question}")
    response, latency = ask_question(question, history)
    history.append({"role": "user", "content": [{"text": question}]})
    history.append({"role": "assistant", "content": response["output"]["message"]["content"]})

Real output

Here's what a five-turn conversation looks like without caching. These are real numbers from our AWS account:

Turn 1: What are the main features of SmartWidget Pro?
  Latency: 2.60s | Input: 2,079 | Output: 137

Turn 2: How do I configure the API integration?
  Latency: 1.61s | Input: 2,235 | Output: 162

Turn 3: What's the pricing for enterprise customers?
  Latency: 1.08s | Input: 2,410 | Output: 63

Turn 4: My API is returning 429 errors. How do I fix this?
  Latency: 1.35s | Input: 2,493 | Output: 79

Turn 5: How do I migrate from v3.x to v4.2?
  Latency: 1.25s | Input: 2,597 | Output: 121

Total across five turns: 11,814 input tokens and 562 output tokens in 7.89 seconds.

Notice how input tokens grow each turn — from 2,079 to 2,597. That's the conversation history accumulating. But the first 2,069 tokens in every call are the same system prompt and product docs, unchanged from Turn 1.

What this costs at scale

Nova Pro charges $0.80 per million input tokens and $3.20 per million output tokens. For this single conversation:

Input cost: 11,814 tokens × $0.80/M = $0.0095
Output cost: 562 tokens × $3.20/M = $0.0018
Total: $0.0113 per conversation

At 1,000 conversations per day, 30 days a month: $337 per month. For one model running one workflow.

Step 2 — Adding Prompt Caching

If you want to try this visually first, the Bedrock Chat Playground has a prompt caching toggle built in. Select your model, scroll down in the left panel, and flip the Prompt caching switch:

Good for quick experiments, but you'll want this in code for production.

The code change

Compare this to the baseline — the only difference is two lines in the system parameter:

# 02_with_prompt_caching.py

def ask_question_cached(question, conversation_history):
    messages = conversation_history + [
        {"role": "user", "content": [{"text": question}]}
    ]

    start_time = time.time()
    response = bedrock.converse(
        modelId=MODEL_ID,
        system=[
            {"text": FULL_SYSTEM},
            {"cachePoint": {"type": "default"}},   # <-- this is the only change
        ],
        messages=messages,
        inferenceConfig={"maxTokens": 512, "temperature": 0.1},
    )
    elapsed = time.time() - start_time
    usage = response["usage"]

    cache_read = usage.get("cacheReadInputTokens", 0)
    cache_write = usage.get("cacheWriteInputTokens", 0)
    print(f"  Latency: {elapsed:.2f}s | Input: {usage['inputTokens']} | "
          f"Output: {usage['outputTokens']} | Cache read: {cache_read} | Cache write: {cache_write}")

    return response, elapsed

That {"cachePoint": {"type": "default"}} tells Bedrock: everything above this marker is static content. Cache it.

Real output with caching

Turn 1: What are the main features of SmartWidget Pro?
  Latency: 2.04s | Input: 10 | Output: 133 | Cache read: 0 | Cache write: 2,069

Turn 2: How do I configure the API integration?
  Latency: 1.13s | Input: 163 | Output: 89 | Cache read: 2,068 | Cache write: 0

Turn 3: What's the pricing for enterprise customers?
  Latency: 0.96s | Input: 266 | Output: 49 | Cache read: 2,067 | Cache write: 0

Turn 4: My API is returning 429 errors. How do I fix this?
  Latency: 1.29s | Input: 334 | Output: 65 | Cache read: 2,068 | Cache write: 0

Turn 5: How do I migrate from v3.x to v4.2?
  Latency: 1.37s | Input: 425 | Output: 127 | Cache read: 2,067 | Cache write: 0

Look at Turn 1 first. Cache write: 2,069 and Cache read: 0 — the prefix has to be stored before it can be reused. This first call is the setup cost.

From Turn 2 onwards, every call shows Cache read: ~2,068 and Cache write: 0. The system prefix is being read from cache at 90% discount instead of reprocessed.

The input token counts tell the same story. Turn 1 reports only 10 input tokens (just the user question) instead of 2,079 — the other 2,069 show up in cacheWriteInputTokens. Turns 2–5 only count the conversation history and new question as input. The prefix tokens move to cacheReadInputTokens.

Cost breakdown

With caching, the billing splits into four components:

Component	Tokens	Rate (per 1M)	Cost
Non-cached input	1,198	$0.80	$0.000958
Cache read	8,270	$0.08 (90% off)	$0.000662
Cache write	2,069	$1.00 (25% premium)	$0.002069
Output	463	$3.20	$0.001482
Total			$0.005171

Side-by-side comparison

Metric	Baseline	Cached	Change
Non-cached input tokens	11,814	1,198	-90%
Total latency (5 turns)	7.89s	6.78s	-14%
Cost per conversation	$0.0113	$0.0052	-54%
Monthly cost (1K convos/day)	$337	$155	-54%
Monthly savings	—	$182/month

The latency improvement is modest at 14%. Caching saves money, not time — the model still processes user messages and generates responses at the same speed. The win is purely in token billing: those 2,069 tokens of system content go from $0.80/M to $0.08/M on every cache hit.

Key rule: The content before the cache point must be byte-for-byte identical across requests. If you change even one character in the system prompt, it's a cache miss and a new cache write. Keep your cached prefix genuinely static.

Step 3 — Comparing Across Model Tiers

Full comparison table

These numbers are from a single benchmark run across all three models, using identical system prompts and questions:

Model	Baseline Monthly	Cached Monthly	Cost Reduction	Input Price
Nova Pro	$334.61	$169.99	49%	$0.80/M
Nova Lite	$30.33	$18.41	39%	$0.06/M
Nova Micro	$16.99	$9.47	44%	$0.035/M

Caching saves 39–49% regardless of model tier. You might notice the Nova Pro number here (49%) differs slightly from the 54% in Step 2 — that's because these are separate benchmark runs with slightly different response lengths. The pattern holds: caching cuts costs by roughly half on Nova Pro and 40–45% on the cheaper tiers. The percentage is lower on cheaper models because the cache write premium ($1.00/M write vs $0.80/M regular input for Pro) has a larger relative impact when the base price is already low.

The real optimization: caching + model selection

Here's the number that matters for production. If your baseline uses Nova Pro without caching, and you switch to Nova Micro with caching, the combined savings are:

$334.61/month → $9.47/month — a 97% reduction.

Step 4 — Caching Tool Definitions for Agentic Workflows

If your agent uses tools, those JSON schema definitions are resent with every API call — just like the system prompt. Cache them too.

You can place up to four cache points per request. A common pattern is two: one after the system content, one after the tool definitions.

# Cache both system content and tool definitions

response = bedrock.converse(
    modelId="amazon.nova-pro-v1:0",
    system=[
        {"text": FULL_SYSTEM},
        {"cachePoint": {"type": "default"}},      # Cache point 1: system + docs
    ],
    messages=messages,
    toolConfig={
        "tools": [
            {
                "toolSpec": {
                    "name": "lookup_order",
                    "description": "Look up a customer order by order ID",
                    "inputSchema": {
                        "json": {
                            "type": "object",
                            "properties": {
                                "order_id": {
                                    "type": "string",
                                    "description": "The order ID to look up"
                                }
                            },
                            "required": ["order_id"]
                        }
                    }
                }
            },
            {
                "toolSpec": {
                    "name": "check_inventory",
                    "description": "Check inventory for a product SKU",
                    "inputSchema": {
                        "json": {
                            "type": "object",
                            "properties": {
                                "sku": {"type": "string", "description": "Product SKU"}
                            },
                            "required": ["sku"]
                        }
                    }
                }
            },
            {"cachePoint": {"type": "default"}},   # Cache point 2: tool definitions
        ]
    },
    inferenceConfig={"maxTokens": 512},
)

Cache point rules to know

Step 5 — Monitoring Cache Performance in Production

Publishing cache metrics to CloudWatch

# 08_observability.py

import boto3

cloudwatch = boto3.client("cloudwatch", region_name="us-east-1")

def publish_cache_metrics(usage, model_id):
    """Publish cache performance metrics after each Bedrock call."""
    cache_read = usage.get("cacheReadInputTokens", 0)
    cache_write = usage.get("cacheWriteInputTokens", 0)
    total_input = usage["inputTokens"] + cache_read + cache_write

    hit_rate = (cache_read / total_input * 100) if total_input > 0 else 0

    cloudwatch.put_metric_data(
        Namespace="BedrockLLMOps",
        MetricData=[
            {
                "MetricName": "CacheHitTokens",
                "Value": cache_read,
                "Unit": "Count",
                "Dimensions": [{"Name": "ModelId", "Value": model_id}],
            },
            {
                "MetricName": "CacheHitRate",
                "Value": hit_rate,
                "Unit": "Percent",
                "Dimensions": [{"Name": "ModelId", "Value": model_id}],
            },
        ],
    )

Setting up alerts

A sudden drop in cache hit rate usually means something changed in your prompt prefix — maybe a deployment modified the system prompt, or a dynamic value crept into what should be static content.

cloudwatch.put_metric_alarm(
    AlarmName="LowCacheHitRate",
    MetricName="CacheHitRate",
    Namespace="BedrockLLMOps",
    Statistic="Average",
    Period=300,
    EvaluationPeriods=3,
    Threshold=70,
    ComparisonOperator="LessThanThreshold",
    AlarmActions=["arn:aws:sns:us-east-1:YOUR_ACCOUNT_ID:llmops-alerts"],
)

Conclusion

One line of code, 49% off. That's the short version. Adding a cachePoint to our system block cut Nova Pro's monthly bill from $335 to $170, no infrastructure changes, no output quality difference. Across all three Nova tiers, savings landed between 39% and 49%.

The bigger win is combining caching with model selection. Nova Pro without caching to Nova Micro with caching: $335 down to under $10. A 97% reduction that mostly comes from the model switch, with caching shaving off the rest.

A few gotchas before you ship this:

Cache writes cost 25% more than regular input. Caching only pays for itself after the second call using the same prefix — so for single-turn workflows with no reuse, skip it.

Monitor your cache hit rate. A misconfigured prompt that changes the prefix between calls will miss the cache every time, silently. The CloudWatch alert from Step 5 catches this.

Where to go next

For production hardening, look into Bedrock Guardrails for content policy enforcement and LLM-as-a-Judge pipelines to continuously validate output quality as you swap models and tune prompts.

sachinm207 / bedrock-prompt-caching-distillation-tutorial

Tutorial: Cut LLM inference costs by up to 90% using Amazon Bedrock prompt caching. Companion code for blog post.

Bedrock Prompt Caching Tutorial

Cut LLM inference costs by up to 90% using Amazon Bedrock prompt caching and model tier selection.

What You'll Build

A production-optimized customer support agent that uses:

Prompt caching to eliminate redundant token processing (39–54% cost reduction)
Cross-model benchmarking across Nova Pro, Lite, and Micro
CloudWatch observability to monitor cache hit rates in production

Project Structure

.
├── 01_baseline_no_cache.py          # Baseline: no optimization
├── 02_with_prompt_caching.py        # Add prompt caching (Converse API)
├── run_all_benchmarks.py            # Run all models × baseline/cached
├── product_docs.txt                 # Sample product documentation (~2,069 tokens)
├── results_01_baseline.json         # Nova Pro baseline results
├── results_02_cached.json           # Nova Pro cached results
├── results_full_nova.json           # All 3 Nova models comparison
└── requirements.txt                 # Python dependencies

Prerequisites

AWS account with Bedrock access enabled in us-east-1
Model access for Amazon Nova Pro, Nova Lite, and Nova Micro (enabled by default)
Python 3.11+ with boto3 >= 1.35.76
AWS CLI v2…

View on GitHub

Appendix: IAM Policy for This Tutorial

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "bedrock:InvokeModel",
                "bedrock:InvokeModelWithResponseStream"
            ],
            "Resource": "arn:aws:bedrock:us-east-1::foundation-model/amazon.nova-*"
        },
        {
            "Effect": "Allow",
            "Action": [
                "cloudwatch:PutMetricData",
                "cloudwatch:PutMetricAlarm"
            ],
            "Resource": "*"
        }
    ]
}

Appendix: Cost Reference

All pricing is per 1 million tokens as of February 2026. Verify current pricing at the Amazon Bedrock pricing page.

Model	Input	Output	Cache Read (90% off)	Cache Write (25% premium)
Nova Pro	$0.80	$3.20	$0.08	$1.00
Nova Lite	$0.06	$0.24	$0.006	$0.075
Nova Micro	$0.035	$0.14	$0.0035	$0.044