Forem: SURYANSH GUPTA

What Is an Agent Harness? And Why Every AI Agent Needs One

SURYANSH GUPTA — Sat, 09 May 2026 07:28:09 +0000

If you've spent any time building with AI lately, you've probably heard the word "agent" thrown around a lot. But here's something that doesn't get talked about nearly as much: before you can have a real AI agent, you need a harness.

I know that term might sound unfamiliar or even a little abstract. When I first came across it, I had the same reaction. But once it clicked, I couldn't unsee it — and I genuinely think it's one of the most important concepts to understand if you want to go beyond just calling an LLM API and actually building something that does things autonomously.

Let's break it all down from scratch.

The Problem With "Just Using a Model"

Picture this: you've got API access to a powerful model like Claude or GPT-4. You send it a prompt, it sends back a response. That's great for chatbots and one-shot completions — but what if you want the model to:

Browse the web and pull real-time data?
Execute Python code to analyze that data?
Remember what you told it last week?
Coordinate across multiple steps — each one depending on the last?
Call your internal APIs or tools?

A raw model, on its own, can't do any of that. It can talk about doing those things, but it has no way to actually carry them out. It's like hiring a brilliant analyst who has no laptop, no internet, and can only communicate by passing notes. The intelligence is there — the infrastructure is not.

That missing infrastructure is the harness.

So, What Exactly Is an Agent Harness?

An agent harness is everything you build around a model to transform it from a text-generator into an agent that can act in the real world.

The cleanest formula I've come across is:

Agent = Model + Harness

Anything in your agent that isn't the model itself — is part of the harness.

In concrete terms, the harness typically includes:

The orchestration loop — the logic that takes a user message, asks the model what to do, runs that action, feeds the result back, and repeats until the task is complete.
Tool connections — the plumbing that lets the model call a browser, run code, query a database, or hit an external API.
Memory — short-term context within a session AND long-term memory that persists across sessions.
Context management — deciding what information goes into the prompt at each step (you can't just keep appending forever — models have token limits).
Compute and sandboxing — somewhere safe for the agent to run code without blowing up your system.
Authentication — so your agent can securely call external APIs without leaking credentials.
Observability — logs, traces, and debugging tools so you know what happened when things go sideways at 2 AM.
Session management — the ability for users to pause and resume, pick up right where they left off.

Look at any AI-powered product you use today — Claude Code, GitHub Copilot, Cursor, Perplexity — and behind the scenes, there's a harness doing all of this work. The model is just one piece of a much larger machine.

Why Harness Building Has Been So Painful

Here's the honest reality: building a harness from scratch is hard and time-consuming.

If you've done it before, you know the drill. You pick a framework — LangGraph, LlamaIndex, CrewAI, Strands Agents — and start writing code. You wire up your tools. You manage your prompt structure carefully so the model doesn't get confused. You add error handling for when tool calls fail. You build retry logic. You handle streaming output. You set up logging. You package everything into a container, provision some compute, and deploy it.

And then you realize you need session persistence. So you add a database. And then you realize you need the agent to authenticate with an external API. So you set up credential management. And now you need to understand why the agent went down a weird reasoning path, so you add tracing.

For a straightforward use case, this might take a few days. For a complex one, it could take weeks — and a whole team.

This is the real barrier to building with AI agents. Not the model. The harness.

Enter Managed Harnesses: The Agent Factory Model

Tooling has finally started catching up to this problem. The idea behind a managed harness is simple: instead of writing all that orchestration and infrastructure code yourself, you declare what your agent needs as configuration, and the service builds the harness for you.

Think of it like the difference between setting up your own server (writing harness code from scratch) versus using a managed cloud service (declaring config and letting the platform handle the rest).

Amazon Bedrock AgentCore is one of the services taking this approach. With AgentCore's harness feature, you define your agent in a JSON config file — model, system prompt, tools, memory settings — and the platform compiles that into a fully running agent, handling all the infrastructure underneath.

Under the hood, AgentCore harness uses Strands Agents (AWS's open-source agent SDK) to assemble the orchestration loop, tool execution, memory management, context handling, and streaming. Then it runs the whole thing inside an isolated microVM — its own dedicated CPU, memory, and filesystem — without you provisioning a single server.

Let's Build Something: An AI Trends Analyst in Minutes

To make this concrete, here's how you'd go from zero to a working AI agent using AgentCore harness — and yes, this genuinely takes about 5 minutes.

The Goal

Build an agent that browses HackerNews and dev.to, pulls today's top AI and developer tools posts, clusters them by topic, and produces a ranked summary with a chart — all autonomously.

Step 1: Install the CLI

npm install -g @aws/agentcore@preview

Step 2: Create Your Agent Config Interactively

agentcore create

This command walks you through a set of prompts — which model to use, which tools to enable, authentication type, and so on. At the end, it generates a config file like this:

{
  "name": "TrendsAgentHarness",
  "model": {
    "provider": "bedrock",
    "modelId": "global.anthropic.claude-sonnet-4-6"
  },
  "tools": [
    {
      "type": "agentcore_browser",
      "name": "browser"
    },
    {
      "type": "agentcore_code_interpreter",
      "name": "code-interpreter"
    }
  ],
  "skills": [],
  "authorizerType": "AWS_IAM"
}

That's it. The browser tool lets the agent navigate real websites. The code interpreter gives it a Python sandbox to crunch data and generate charts.

Step 3: Write Your System Prompt

Edit the system-prompt.md file that was created alongside the config:

Your job is to keep a pulse on what the AI and dev community is buzzing 
about right now. Every session, head over to HackerNews and dev.to, 
scrape today's hottest posts, then use the code interpreter to make sense 
of it all — cluster the topics, rank them by how often they show up, and 
summarize the top 5 in plain language. Throw in a bar chart. No fluff.

The system prompt is your agent's personality and operating instructions. This is where you define what it does, how it thinks, and what output you expect from it.

Step 4: Deploy It

agentcore deploy

Behind the scenes, this takes your config and system prompt, assembles a Strands Agents program, and deploys it into a managed microVM environment. No Dockerfile, no Kubernetes, no EC2 instance. Just one command.

Step 5: Invoke It

agentcore invoke --harness TrendsAgentHarness \
  --session-id "$(uuidgen)" \
  "What's trending in IT today?"

What happens when you run this:

The agent opens a browser and navigates to HackerNews.
It scrolls through and reads the top posts.
It does the same on dev.to.
It pulls all the results into the code interpreter.
It runs Python to cluster topics, calculate frequency, and build a bar chart.
It streams a formatted summary back to your terminal.

All of this runs in an isolated microVM that spins up for this session and tears down when it's done. No cross-session data leakage, no noisy neighbors.

What Comes Built In

Here's a breakdown of what AgentCore harness gives you without any extra setup:

Capability	What It Actually Means For You
Isolated microVM per session	Your agent gets its own CPU, memory, and filesystem. Sessions are completely isolated from each other.
Shell access	The agent can run shell commands directly without going through the model's reasoning loop — faster and cheaper.
Persistent filesystem	Mid-session, the agent can save files, pause, and resume exactly where it left off.
Model-agnostic routing	Switch between Bedrock, OpenAI, and Google Gemini. You can even change providers mid-session and the conversation context stays intact.
Built-in browser tool	Powered by AgentCore Browser — the agent can navigate real websites, not just search APIs.
Built-in code interpreter	A full Python sandbox. The agent can write and execute code, generate charts, process files, and more.
MCP server support	Connect to any MCP-compatible tool server — Slack, Notion, GitHub, whatever your workflow needs.
AgentCore Gateway	Connect to APIs you've registered centrally, so credentials are managed outside the agent.
Custom tool definitions	Define your own inline function tools for the agent to call.
Skills	Package domain knowledge as markdown + scripts and give your agent expert-level context on demand.
Full observability	Every action is auto-traced via AgentCore Observability, so you can debug and audit everything that happened.

Agent Skills: Teaching Your Agent Domain Expertise

One feature worth calling out specifically is skills. An agent skill is a bundle of markdown instructions and (optionally) scripts that gives your agent deep knowledge about a specific domain or workflow.

Think of it this way: you can train a general model on your specific context. For example:

A skill that teaches the agent how to work with your internal data format.
A skill that walks the agent through your company's API conventions.
A skill that gives the agent step-by-step knowledge of how to process Excel reports your way.

You package the skill into the agent's environment, point the harness at it, and the agent picks it up and uses it automatically when relevant. No fine-tuning. No custom model training. Just structured knowledge the agent can reference.

The Escape Hatch: When You Outgrow Config

One question you might be asking: "What happens when my use case gets complex enough that a config file isn't enough?"

That's a fair and important question. Maybe you need:

Custom multi-agent orchestration where agents hand off tasks to each other.
Specialized routing logic based on the content of a message.
A fully custom memory layer with your own vector database.
Integration with internal infrastructure that doesn't fit a standard pattern.

AgentCore harness has an answer for this: export to code.

When you need full control, you can export your harness configuration to Strands Agents code. You get the equivalent Python program that AgentCore was running for you — fully readable, fully editable — and you can extend it however you need. You stay on the same platform, just with more control.

This is a smart design. You start with the fast path (config), and you graduate to the custom path (code) only when you actually need it. You're not locked into one or the other.

Common Questions Answered

Do I need to build a harness if I'm just using Claude.ai or ChatGPT?

No. Those are consumer products where someone else already built the harness for you. You need to build your own when you're creating custom agents — ones that call your specific tools, connect to your internal systems, maintain state, or run autonomously over multiple steps.

Is a harness the same as an agent framework?

Not quite. A framework (like Strands Agents, LangGraph, or CrewAI) gives you the building blocks — tool interfaces, loop patterns, model connectors. A harness is the fully assembled, running system: framework code plus compute, sandboxing, memory, auth, and observability. You use a framework to build a harness, or you use a managed service to build one for you.

Can I build a harness without a framework?

Technically yes, but you'd be writing the entire orchestration loop, tool dispatch, error recovery, and context management from scratch. Frameworks exist precisely so you don't have to. It's a bit like writing raw socket code instead of using Express.js — possible, but almost never the right call.

Is the browser tool expensive on tokens?

Yes, it does consume more tokens than simpler tools since it's processing full web pages. For the trends analyst use case, it's absolutely worth it. For agents that need lighter-weight data fetching, you might want to explore API-based tools or MCP servers that return structured data instead.

Why This Matters for the Community

For a long time, building a production-grade AI agent required deep expertise across model APIs, orchestration frameworks, cloud infrastructure, and security. That's a lot of disciplines to combine, and it's been a genuine barrier for developers who want to experiment and build.

Managed harness services like AgentCore change that equation. The gap between "I have an idea for an agent" and "I have a running agent" is now measured in minutes for straightforward use cases. That's genuinely exciting.

It also means the interesting work shifts. Instead of spending your energy on infrastructure plumbing, you can focus on:

What should your agent actually do?
What domain knowledge does it need?
What tools should it have access to?
How should it reason and communicate?

Those are the questions worth spending your time on.

Where to Go From Here

AgentCore harness is currently in public preview in four AWS regions: US West (Oregon), US East (N. Virginia), Europe (Frankfurt), and Asia Pacific (Sydney).

Here are the resources to get started:

The trends analyst agent described in this post — browsing HackerNews, clustering AI topics, generating a chart — took about 5 minutes from idea to first working invocation. The JSON config is 15 lines. The system prompt is 5 lines.

What would you build with 5 minutes and a config file? I'd love to see what the community comes up with. Drop your ideas or experiments in the comments.

If this post helped you understand agent harnesses better, consider sharing it with someone who's been struggling to wrap their head around the agent architecture puzzle. And if you're already building harnesses the hard way, maybe it's time to let the factory do some of that work for you.

Cut Amazon Bedrock Costs with a 3-Layer Caching Pipeline on AWS Lambda + ElastiCache

SURYANSH GUPTA — Tue, 05 May 2026 03:46:22 +0000

If you're building AI-powered apps on AWS, you've probably felt the sting of Bedrock inference costs. Every token counts — and when users hammer your app with similar or identical questions, you're paying for the same answer over and over again.

In this post I'll walk through a three-layer caching and optimization pipeline I built inside a single Lambda function backed by ElastiCache (Redis). By the end, you'll have a pattern that can dramatically reduce Bedrock calls in any support chatbot, internal knowledge assistant, or document Q&A tool you're shipping.

Here's what we're building:

User prompt → Hash Check → Semantic Check → Prompt Compression → Bedrock → Cache Write
                  ↓               ↓
             hash_hit        semantic_hit

Architecture at a Glance

Component	Role
AWS Lambda (Python)	Caching logic, embedding, compression
Amazon ElastiCache (Redis 7.1)	Persistent shared memory across invocations
Amazon Bedrock (Nova Micro)	Foundation model, only called on a true miss
Titan Embeddings v2	Converts prompts to semantic vectors

Because Lambda is stateless, every invocation starts fresh with zero memory of prior calls. ElastiCache fills that gap — it's the shared brain that persists across invocations and across different users hitting your function simultaneously.

Layer 1 — Hash-Based Caching: The Fastest Win

Before anything touches Bedrock, we check whether we've already answered this exact question.

The trick is normalizing the prompt first — lowercase, collapse whitespace — so " What is MACHINE LEARNING? " and "what is machine learning?" produce the same SHA-256 fingerprint and share one cache entry.

def normalize(prompt: str) -> str:
    return " ".join(prompt.lower().split())

def compute_hash(prompt: str) -> str:
    return hashlib.sha256(normalize(prompt).encode()).hexdigest()

On every invocation, we check Redis with the hash: prefix before doing anything else:

hash_key = HASH_PREFIX + compute_hash(prompt)
cached = redis_client.get(hash_key)
if cached:
    return {"response": cached, "cache": "hash_hit"}

A hash hit costs you a single Redis GET — no embedding call, no Bedrock invocation, no tokens burned. This is the fastest and cheapest path through the entire pipeline.

When does this shine? Any FAQ-style workload where users repeatedly ask the same questions. Support bots. Help center chatbots. Internal HR assistants.

Layer 2 — Semantic Similarity Caching: Catching Paraphrases

Hash-based caching misses paraphrases. "What is machine learning?" and "How would you define machine learning?" are semantically identical but produce completely different hashes.

Semantic caching solves this with vector embeddings. We convert every prompt to a list of floats that encodes its meaning, then compare incoming prompts to stored vectors using cosine similarity.

def embed(prompt: str) -> np.ndarray:
    bedrock = boto3.client("bedrock-runtime", region_name="us-west-2")
    response = bedrock.invoke_model(
        modelId=EMBED_MODEL_ID,
        body=json.dumps({"inputText": prompt}),
        contentType="application/json",
        accept="application/json"
    )
    body = json.loads(response["body"].read())
    return np.array(body["embedding"], dtype=np.float32)

def cosine_similarity(a: np.ndarray, b: np.ndarray) -> float:
    norm_a, norm_b = np.linalg.norm(a), np.linalg.norm(b)
    if norm_a == 0 or norm_b == 0:
        return 0.0
    return float(np.dot(a, b) / (norm_a * norm_b))

Since Redis stores bytes, not arrays, we serialize the vector with struct.pack before writing and unpack it on read:

def serialize_embedding(vector: np.ndarray) -> bytes:
    return struct.pack(f"{len(vector)}f", *vector)

def deserialize_embedding(data: bytes) -> np.ndarray:
    n = len(data) // 4
    return np.array(struct.unpack(f"{n}f", data), dtype=np.float32)

In the handler, after a hash miss we embed the incoming prompt and scan stored vectors:

query_vector = embed(prompt)
stored = load_embeddings()
best_score, best_response = 0.0, None

for _, vector, response in stored:
    score = cosine_similarity(query_vector, vector)
    if score > best_score:
        best_score = score
        best_response = response

if best_score >= SIMILARITY_THRESHOLD:
    return {"response": best_response, "cache": "semantic_hit", "score": round(best_score, 4)}

The SIMILARITY_THRESHOLD environment variable (default 0.90) is your dial for how aggressive the matching should be. Lower it to 0.80 and you'll catch more paraphrases at the risk of serving a slightly off response. Tune it against your own traffic.

💡 In practice, I've seen semantic_hit catch prompts like "Explain ML to me" against a cached answer for "What is machine learning?" with a score around 0.94 — well above threshold, and a completely avoided Bedrock call.

Layer 3 — Prompt Compression: Saving Tokens on Every Miss

Even with two cache layers, some prompts will always be new. Prompt compression squeezes cost out of every genuine cache miss by stripping filler language before the prompt reaches Bedrock.

Filler phrases like "Could you please", "I was wondering if", or "As an AI language model" consume tokens without improving the model's response. We maintain a simple list and strip them at runtime:

FILLER_PHRASES = [
    "please could you",
    "i was wondering if",
    "could you please",
    "i would like you to",
    "as an ai",
    "can you please",
    # ... extend this list based on your traffic patterns
]

def compress(prompt: str) -> str:
    compressed = prompt.lower()
    for phrase in FILLER_PHRASES:
        compressed = compressed.replace(phrase, "")
    compressed = " ".join(compressed.split())

    original_tokens = len(prompt.split())
    compressed_tokens = len(compressed.split())
    print(f"[compression] original: {original_tokens} tokens, "
          f"compressed: {compressed_tokens} tokens, "
          f"saved: {original_tokens - compressed_tokens}")

    return compressed

The CloudWatch log line gives you a measurable view of the savings on every miss — you can query these logs over time to identify your most common filler patterns and keep optimizing the list.

One critical design decision: compression runs after both cache checks, not before.

If you compressed first, you'd alter the prompt before hashing it — so "Could you please explain ML?" and "Explain ML" would hash to the same key on the second call but different keys on the first, breaking cache consistency. The original prompt is always used for cache lookups; compression is purely a token cost optimization that only fires when a Bedrock call is actually going to happen.

The Full Pipeline in `lambda_handler`

Putting it all together, the handler becomes a clean sequential pipeline:

def lambda_handler(event, context):
    prompt = event.get("prompt", "").strip()

    # Layer 1: Exact hash match — fastest path, zero AI calls
    hash_key = HASH_PREFIX + compute_hash(prompt)
    cached = redis_client.get(hash_key)
    if cached:
        return {"response": cached.decode(), "cache": "hash_hit"}

    # Layer 2: Semantic similarity — catches paraphrases
    query_vector = embed(prompt)
    stored = load_embeddings()
    best_score, best_response = 0.0, None
    for _, vector, response in stored:
        score = cosine_similarity(query_vector, vector)
        if score > best_score:
            best_score, best_response = score, response
    if best_score >= SIMILARITY_THRESHOLD:
        return {"response": best_response, "cache": "semantic_hit", "score": round(best_score, 4)}

    # Layer 3: Compress before sending to Bedrock
    compressed_prompt = compress(prompt)
    response_text = call_bedrock(compressed_prompt)

    # Write back both hash and embedding for future hits
    redis_client.set(hash_key, response_text.encode(), ex=CACHE_TTL_SECONDS)
    embed_key = EMBED_PREFIX + compute_hash(prompt)
    store_embedding(embed_key, query_vector, response_text)

    return {"response": response_text, "cache": "miss"}

Key Observations from Testing

Prompt	Cache Result	Bedrock Call?
`"What is machine learning?"` (1st call)	`miss`	✅ Yes
`"What is machine learning?"` (2nd call)	`hash_hit`	❌ No
`" What is MACHINE LEARNING? "`	`hash_hit`	❌ No
`"How would you define machine learning?"`	`semantic_hit (0.94)`	❌ No
`"Could you please explain what machine learning is?"`	`miss` → compressed	✅ Yes (fewer tokens)

When to Use This Pattern

This three-layer pipeline is most valuable when:

Query volume is high — the cost savings on cache hits compound quickly at scale
Users tend to ask similar questions — support bots, knowledge bases, FAQ tools
Prompts are verbose — compression delivers more savings when users write long-winded queries
Latency matters — a Redis GET is orders of magnitude faster than a Bedrock roundtrip

It's less impactful for highly creative or unique queries (content generation, code synthesis) where every prompt is genuinely different and semantic similarity won't trigger often.

What I'd Do Differently in Production

A few things worth considering as you take this pattern to prod:

Replace the linear embedding scan with a proper vector search (Redis Stack's HNSW index, or OpenSearch with k-NN). Scanning every stored embedding is fine at low volume but doesn't scale.
Instrument cache hit rates with CloudWatch metrics so you can track ROI over time and justify the ElastiCache spend.
Tune SIMILARITY_THRESHOLD per use case. A support bot can be aggressive (0.85); a medical or legal assistant should be conservative (0.95+).
Analyze your CloudWatch compression logs weekly and update FILLER_PHRASES based on real traffic patterns.
Add a warm-up step for known common queries — pre-populate the cache on deploy so the very first user gets a cache hit.

Wrapping Up

Three layers, one Lambda function, one ElastiCache cluster. Together they cover the most common sources of Bedrock cost:

Layer	What it eliminates
Hash caching	Exact duplicate calls
Semantic caching	Paraphrased duplicate calls
Prompt compression	Excess tokens on every genuine miss

The pattern is modular — you can adopt any one layer independently, and each one pays for itself at a different traffic threshold. Start with hash caching (zero additional AWS cost beyond ElastiCache), add semantic caching once you see recurring paraphrases in your logs, and layer in prompt compression as your prompt corpus grows longer.

If you're building on Amazon Bedrock, this is one of the highest-ROI architectural patterns you can drop into an existing Lambda-based backend with minimal rework.

Built and tested as part of an AWS hands-on lab. All code runs on Python 3.12, Redis 7.1, and Amazon Bedrock Nova Micro via a cross-region inference profile.

Have questions or want to share your own caching numbers? Drop them in the comments below 👇