Forem: Shilpa Mitra

How Claude Code Achieves a 92% Cache Hit Rate: A Deep Dive Into Prompt Caching for AI Agents

Shilpa Mitra — Sun, 24 May 2026 17:08:31 +0000

If you're running AI agents in production, there's a cost you're probably not thinking about.

Every turn in an agentic conversation sends the full prompt to the model. That includes the system instructions, all the tool definitions, any project context that was loaded earlier, and the entire conversation history. The model processes all of it. From the top. Every single time.

For a quick two-turn interaction, this doesn't matter much. But for a 50-turn coding session where the system prompt alone is 20,000 tokens? That's 1 million tokens of repeated computation across the session, all billed at full input price, all producing zero new insight. The model already processed that system prompt 49 turns ago. It's just doing it again because nothing told it not to.

This is the problem prompt caching solves. And Claude Code is probably the best case study of how to do it right.

The Two Parts of Every Prompt

The first thing to understand is that not all tokens in a prompt are created equal.

Look at any agentic API call and you'll see two distinct layers:

The foundation. This is everything that stays the same from turn to turn. System instructions, tool schemas, project-level context like a CLAUDE.md file, behavioral rules. If you looked at turn 1 and turn 47 side by side, this part would be identical.

The conversation. This is everything that's different each turn. The user's latest message, tool call results, file contents that were just read, terminal output. This grows with every interaction and is genuinely new information the model needs to process.

The entire trick behind prompt caching is recognizing that the foundation doesn't need to be reprocessed. You compute it once, store the result, and reuse it on every subsequent turn. The model only does fresh work on the conversation layer.

What's Actually Being Cached: The Transformer Angle

This isn't just skipping a string comparison. To understand why caching cuts costs so dramatically, you need to know what the model does when it reads a prompt.

LLM inference has two stages:

Prefill: the model takes your entire input and runs it through dense matrix multiplications, token by token, building an internal representation. This is computationally expensive and it's where most of the time and cost goes.
Decode: the model generates its response one token at a time, mostly just reading from the state it already built during prefill.

During prefill, the model computes three vectors for every token: Query, Key, and Value. These are the building blocks of the attention mechanism, how the model figures out which parts of the input matter for which other parts.

The important property: Key and Value vectors for any given token only depend on the tokens before it. They're deterministic. If the input is the same, the output is the same.

So once you've computed the Key-Value pairs for a 20,000-token system prompt, you can store them. Next time a request comes in with that same prefix, you skip the entire prefill computation for those 20,000 tokens and go straight to processing the new content.

Anthropic's infrastructure does this by hashing the input prefix. Same hash, same cached tensors, no recomputation. Different hash (even one byte different), full recomputation.

The Economics

Here's where this gets concrete. Anthropic's caching pricing has three tiers:

Operation	Multiplier	What it means
Cache reads	0.1x base input price	90% discount on every cached token
5-minute cache writes	1.25x base input price	Small premium to store the KV tensors
1-hour cache writes	2x base input price	Extended TTL for longer sessions

For Claude Sonnet 4.6 ($3/MTok base input), here's what that looks like in practice:

Standard input:     $3.00 / MTok
Cache read:         $0.30 / MTok   (90% savings)
5-min cache write:  $3.75 / MTok   (25% premium, one-time)
1-hour cache write: $6.00 / MTok   (2x premium, one-time)

A cache hit costs 10% of standard input. That means caching pays for itself after just one subsequent read for the 5-minute duration. For a 50-turn session reusing a 20,000-token prefix, the savings compound on every single turn.

Tracking a Real Claude Code Session

Theory is nice. Let's trace the actual token economics of a single debugging session to see where the money goes.

You open Claude Code in a Next.js project. The moment the session starts, it loads the system prompt, all available tool definitions (file read, file write, bash, grep, glob, and others), and your project's CLAUDE.md. That initial payload lands somewhere around 20,000 tokens. Every single one of those tokens is processed fresh. This is the only time you pay full price for them.

You type:

"There's a race condition in the checkout flow. Orders are occasionally duplicating when users double-click the submit button."

Claude Code doesn't just start editing files. First, it spins up an Explore subagent to understand the codebase. That subagent reads your API routes, checks your database schema, looks at your order processing logic, and examines the frontend form handler. All of those file reads and grep results get appended to the growing conversation as tool outputs.

Here's the key: none of that new content touches the 20,000-token prefix. The system prompt, the tool definitions, the CLAUDE.md, all of that is still sitting in cache from turn one. Every subsequent API call reads those 20,000 tokens at $0.30/MTok instead of $3.00/MTok. You're only paying full price for the new stuff: your message and the tool outputs.

The Explore subagent finishes and hands its findings back to the main agent. But it doesn't dump 15,000 tokens of raw file contents into the conversation. It passes a condensed summary: which files are relevant, what the current logic does, where the race condition likely lives. This is a deliberate design choice. Keeping the dynamic tail compact means the cache ratio stays high.

Now the Plan subagent kicks in. It takes the summary, reasons through the fix (idempotency key on the frontend, deduplication check on the API, database unique constraint as a safety net), and produces a step-by-step implementation plan. You approve it. Claude Code starts writing code.

Over the next 15 minutes, you go back and forth. It writes the idempotency logic, you ask it to also handle the case where the page refreshes mid-checkout, it adjusts. Each of these turns adds new content to the dynamic tail. But the foundation, those 20,000 tokens, is read from cache every single time. Each cache hit also resets the TTL, so the cache never expires as long as you keep working.

By the end of the session, you've gone through maybe 25 turns. The total tokens processed easily exceeds 1.5 million. But if you run /cost, the bill tells a very different story than 1.5M tokens at full price. The vast majority were cache reads at a 90% discount.

That's the difference between a $4.50 session and a $0.90 session. For one debugging task.

The Production Numbers

This isn't theoretical. Claude Code's production metrics:

Metric	Value
Cache hit rate	92%
Cost reduction	81%
First-token latency reduction	79%

In active sessions, 95%+ of input tokens are typically cache hits, billed at 0.1x the base price. Out of 400K tokens in a session, maybe 20K to 40K are billed at full price.

Without prompt caching, a long Opus coding session (100 turns with compaction cycles) can cost $50 to $100 in input tokens. With it, $10 to $19.

The One Thing That Will Tank Your Cache Hit Rate

Prompt caching has a gotcha that trips up almost everyone the first time.

The cache key is a hash of the exact byte sequence of your prompt prefix. Not the meaning. Not the content. The exact bytes, in the exact order. If you rearrange two paragraphs in your system prompt, the hash changes. Full cache miss. Everything recomputed at full price.

This has three practical consequences:

1. Don't change your tool set mid-session

Tool definitions are part of the cached prefix. If you add a tool on turn 12 that wasn't there on turn 1, every token after the change point is a cache miss. Load everything you might need at the start.

2. Don't switch models mid-conversation

Each model has its own cache. Moving from Opus to Sonnet to save money on a later turn means rebuilding the cache from zero for the new model. You'll spend more on the rebuild than you saved on the cheaper rate.

3. Don't edit the system prompt to update state

If your agent needs to track something (like "user is now authenticated"), don't inject that into the system prompt. Append it as a note in the next user message instead. The system prompt stays byte-identical, the cache stays valid.

Claude Code follows all three of these rules religiously. That's how it maintains a 92% hit rate across millions of sessions.

Applying This to Your Own Agents

If you're building on the Anthropic API, the same principles apply. Here's the practical playbook.

Prompt structure matters

Put the most stable content at the top:

1. System instructions and rules        (most stable, cached first)
2. Tool definitions                      (stable for session duration)
3. Reference documents / retrieved context
4. Conversation history + tool outputs   (dynamic, grows each turn)

The cache works from the top down. Everything above the first change point stays cached. Everything below it gets recomputed.

Use auto-caching

Anthropic's API now supports automatic cache management. You add a single cache_control field to your request and the system handles breakpoint placement for you:

{
  "model": "claude-sonnet-4-6-20260514",
  "max_tokens": 1024,
  "cache_control": { "type": "ephemeral" },
  "system": "Your system prompt here...",
  "messages": [...]
}

It moves the cache boundary forward as the conversation grows and more content becomes stable. Before this existed, you had to manually calculate token boundaries. Getting it wrong meant missing the cache entirely.

Compact without breaking the cache

When your conversation hits the context limit and you need to summarize it down, keep the system prompt and tool definitions identical. Add the compaction instruction as a new user message. The cached prefix stays valid. You only pay fresh tokens for the compaction prompt itself.

Monitor your hit rate

Every API response includes three fields you should be tracking:

{
  "usage": {
    "cache_creation_input_tokens": 15200,
    "cache_read_input_tokens": 184800,
    "input_tokens": 3400
  }
}

cache_creation_input_tokens: tokens written to cache (first time processing)
cache_read_input_tokens: tokens read from cache (the cheap ones)
input_tokens: tokens processed at full price (no cache available)

The ratio of cache_read_input_tokens to total input tokens is your cache efficiency score. Track it like you'd track uptime. A sudden drop means something in your prompt structure changed and invalidated the cache.

Key Takeaways

Prompt caching isn't a setting you flip on and forget about. It's an architectural pattern that has to be baked into how your agent constructs its prompts, manages its tools, and handles long conversations.

Claude Code shows what this looks like when it's done well: 92% cache hit rate, 81% cost reduction, built on stable prefixes, subagent summarization, and cache-aware context management.

If you're building agents and not thinking about your cache architecture, you're leaving most of your budget on the table.

We break down AI infrastructure and tooling like this regularly at Web After AI. Practical, no hype, explained so it actually makes sense.

The 4 Levels of Hermes Agent Scaling Framework: From One Hermes Agent to a Fully Automated Team

Shilpa Mitra — Fri, 22 May 2026 11:56:54 +0000

Most people set up an AI agent and immediately start thinking about multi-agent architectures. Orchestrators, specialist swarms, automated pipelines. That's Level 4 thinking applied to a Level 1 setup, and it's how you end up with a fleet of agents shipping garbage at scale.

Hermes Agent by Nous Research (160K+ stars, fastest-growing open-source agent of 2026) is built for exactly this kind of progressive scaling. It's self-hosted, self-improving, stores everything locally in SQLite, and supports multi-agent orchestration out of the box as of v0.6.0.

But the framework below isn't Hermes-specific. It applies to any agent system. The tool doesn't matter as much as the progression.

Here are the four levels, what each one looks like in practice, and how to know when you're actually ready to move up.

First: What Hermes Agent Is

Hermes is an autonomous AI agent that runs on your machine or VPS. It takes a goal, breaks it into steps, picks from 47 built-in tools to execute, and iterates until the task is done. Everything stays local.

What sets it apart: after each task, Hermes writes a structured record of what worked and what didn't into episodic memory. On future tasks with similar patterns, it retrieves those records and adjusts its approach before starting. It also creates reusable "skills" from experience, essentially building procedural memory that improves over time.

It connects to 20+ messaging platforms (Telegram, Discord, Slack, WhatsApp, Signal, and more), supports MCP servers, and runs across 6 terminal backends (local, Docker, SSH, Daytona, Singularity, Modal).

Install:

curl -fsSL https://raw.githubusercontent.com/NousResearch/hermes-agent/main/scripts/install.sh | bash

Or via pip:

pip install hermes-agent
hermes postinstall

Then configure:

hermes doctor      # check your environment
hermes model       # pick a model
hermes config set  # add API keys
hermes             # start the agent

Takes about 60 seconds on Linux, macOS, or WSL2.

Level 1: The Main Agent

You → Your Soul Hermes Agent

This is where everyone starts, and where most people should stay for weeks, not days.

Your single Hermes instance is your prototype area. You test workflows here. You refine prompts. You figure out which tasks the agent handles well and which ones it fumbles. You build up its memory and skills on your specific work.

At this level, Hermes doubles as your orchestrator by default. You give it a complex task, it breaks it down, it executes. The self-improving loop is already running: every completed task makes it slightly better at similar tasks next time.

What to do at Level 1

Run real work through it daily. Not toy examples. Actual tasks from your workflow. The memory system only gets useful with real data.
Manage its memory actively. Use /recall to search what it remembers and /remember to manually save important context. Correct it when it gets things wrong.
Install skills or let it create them. Skills are procedural memory. Hermes can build them from experience, or you can install community-contributed ones from the Skills Hub.
Connect one messaging platform. Telegram is the easiest. Run hermes gateway setup to get always-on access from your phone. This changes the dynamic from "sitting at my terminal to use AI" to "texting my agent whenever I need something."

When to move on

When you have at least 2-3 workflows that are consistently producing good output. Not acceptable output. Not "close enough." Good output that you'd be comfortable shipping without heavy editing.

This is the most important checkpoint in the entire framework. Everything that comes after multiplies the quality you establish here.

Level 2: Specialized Agents

You → SEO Agent
You → Content Pipeline Agent
You → DevOps Agent

Once a workflow is solid and repeatable, break it out into its own Hermes instance with its own credentials, memory, and scope.

Why separate instances?

Context pollution. An agent that handles your SEO research, your email drafting, and your code reviews is juggling three different domains in one memory space. Its SEO skills get diluted by code review patterns. Its writing voice gets contaminated by technical documentation habits.

Specialized agents have cleaner memory, more focused skills, and better output because they only learn from one domain.

How to do this practically

Each Hermes instance runs independently. Use different configuration profiles, or spin each one up in its own Docker container or VPS.

# Different profiles for different agents
HERMES_PROFILE=seo hermes
HERMES_PROFILE=contentpipeline hermes
HERMES_PROFILE=devops hermes

Each profile gets its own SQLite database, its own memory, its own skill library. You talk to each one directly. You're still the orchestrator at this stage, manually deciding which agent handles which task.

What to do at Level 2

Write a scope document for each agent. What it does, what it doesn't do, what tools it has access to. This isn't bureaucracy. It's how you prevent scope creep across agents.
Let each agent build its own skill library within its domain. The SEO agent's skills should be about keyword research and competitor analysis, not email copywriting.
Keep the count low. 2-3 specialists is plenty to start. The temptation to spin up a new agent for every task is strong. Resist it.

When to move on

When you're spending more time routing tasks between agents than actually reviewing their output.

Level 3: Orchestrated Team

You → Orchestrator Agent
           ↓
     Your Specialized Agents

Now you bring the orchestrator agent back. But this time it's not your prototype agent wearing multiple hats. It's a dedicated Hermes instance whose only job is routing tasks to your specialists and synthesizing their outputs.

Hermes v0.6.0 added multi-agent orchestration. The orchestrator analyzes a complex task, identifies the optimal work breakdown, and spawns specialist worker agents with tailored context. Each worker gets its own scope and tools, returns a verifiable artifact, and records the handoff.

Example workflow

You tell the orchestrator: "Research competitors in the CRM space and draft a blog post about our differentiators."

The orchestrator:

Routes the research task to your Research agent
Takes the research output and routes the writing task to your Content agent
Synthesizes the outputs into a final deliverable
Returns it to you for review

You still review the final output. You're not out of the loop. You're just not manually routing between agents anymore.

What to do at Level 3

Set up task tracking. Kanban-style works well. You need visibility into what each agent is working on, what's queued, and what's done.
Define handoff protocols. What does the research agent pass to the content agent? What format? What level of detail? Ambiguous handoffs create ambiguous output.
Review regularly. Quality issues compound fast in multi-agent setups. A small drift in the research agent's output becomes a big problem by the time it's been through two more agents.

When to move on

When the orchestrator's routing decisions are consistently correct and the specialist outputs consistently meet your quality bar without heavy editing.

Level 4: Automated Team

Cron Job / Trigger Events → Orchestrator Agent
                       ↓
                 Full Agent Team

This is where you step out of the loop for routine work. Cron jobs and event triggers fire tasks into the orchestrator. The orchestrator routes them to the team. The team handles the work asynchronously.

What this looks like in practice

Every Monday at 8am, the orchestrator triggers your SEO agent to pull keyword rankings, your content agent to draft the weekly newsletter outline, and your ops agent to generate a metrics report.
When a new competitor blog post is published (event trigger), the research agent analyzes it and the content agent drafts a response piece.
When a support ticket hits a specific tag, the ops agent drafts a response for your review queue.

The task bus handles queuing and routing. Agents pick up work, complete it, and log results. You check in when you want to, not because you have to.

What to do at Level 4

Start with one automated workflow, not ten. Get one cron job running reliably before adding more. Debugging a broken automation is harder when you have twelve of them running simultaneously.
Build in quality gates. Not every output needs your review, but have the orchestrator flag anything that falls below a confidence threshold for human review.
Monitor closely at first. The trust you build here is earned, not assumed. Look at outputs daily for the first two weeks, then taper to spot-checks.

The Part That Matters More Than Any of This

Take small steps. You do NOT want to automate slop.

If your output at Level 1 is mediocre, you are about to scale mediocrity. 20 agents shipping low-quality work at speed is worse than 3 shipping great work slowly. Every level multiplies whatever quality you've established at the level before it.

I'd rather run fewer agents with better output than max the agent count and spit out more of the same.

The progression isn't about moving fast. It's about moving when you're ready. Level 1 might take you a month. Level 2 might take another month. That's fine. The agents aren't going anywhere. Your quality bar is what matters.

Resources

I write about practical AI agent workflows, open-source tools, and the infrastructure behind them at Web After AI. No hype, just stuff you can actually use.

4 GitHub Repos That Prove AI Agents Aren't Just for Coding Anymore

Shilpa Mitra — Thu, 21 May 2026 17:08:15 +0000

Six months ago, "AI agent" basically meant "coding assistant." Claude Code, Copilot, Cursor. All doing the same thing: helping you write code.

That's changing. The most interesting open-source projects right now aren't building yet another coding agent. They're building agents that specialize: agents that trade stocks, agents that run your entire content marketing operation, agents that make your coding agent actually follow engineering discipline.

The model is the same underneath. The harness around it is what makes it useful for a specific job.

Here are four repos that show where this is heading, with setup instructions for each.

1. mattpocock/skills (91.7K stars) — Make Your Coding Agent an Actual Engineer

Repo: github.com/mattpocock/skills

Matt Pocock (the TypeScript educator behind Total TypeScript) open-sourced his personal .claude directory. It's a collection of skills that fix the most common failure modes of AI coding agents: building the wrong thing, skipping tests, producing code that works but is impossible to maintain, and declaring "done" when nothing actually compiles.

Most people treat their coding agent like an intern with no process. Matt's skills give it the process.

The standout: `/grill-me`

This skill forces the agent to interrogate you about what you actually want before writing a single line of code. It's a structured interview that catches misalignment before it becomes a wasted hour. There's also /grill-with-docs, which does the same thing but additionally builds a shared vocabulary between you and the agent in a CONTEXT.md file.

The CONTEXT.md approach is quietly brilliant. Instead of the agent using 20 words to describe something, you teach it your project's jargon. Over time, the agent's outputs get shorter, more precise, and the variables and functions it creates use consistent naming. It also reduces token usage, because concise terminology means shorter prompts and responses.

Other skills worth knowing

/tdd — Test-driven development with red-green-refactor. The agent writes a failing test first, then fixes it. Far better code quality than "write the feature, then maybe add tests."
/diagnose — Disciplined debugging loop: reproduce, minimise, hypothesise, instrument, fix, regression-test.
/improve-codebase-architecture — Finds structural improvements using your project's domain language from CONTEXT.md.
/handoff — Compacts the current conversation into a handoff document so another agent (or a new session) can continue the work without losing context.
/caveman — Ultra-compressed communication mode. Cuts token usage by roughly 75% while keeping full technical accuracy. Useful when you're burning through credits.

Setup

npx skills@latest add mattpocock/skills

Pick the skills you want and which coding agents to install them on. Make sure you select /setup-matt-pocock-skills during install. Then run that command in your agent, and it'll configure your issue tracker (GitHub, Linear, or local files), triage labels, and docs location. Works with Claude Code, Cursor, Codex, and others.

How it compares to Addy Osmani's agent-skills

If you've seen addyosmani/agent-skills, you might wonder how these differ. Addy's skills focus on the full development lifecycle with slash commands like /spec, /plan, /build, /ship. Matt's skills focus more on engineering fundamentals: alignment, testing discipline, debugging, and architecture quality. They're complementary, not competing. You can run both in the same project.

2. AI-Trader (13.7K stars) — Let AI Agents Trade for You

Repo: github.com/HKUDS/AI-Trader

AI-Trader is an agent-native trading platform built by researchers at the University of Hong Kong. The core idea: just like humans have their trading platforms, AI agents need their own.

You connect your AI agent (Claude Code, Cursor, OpenClaw, Codex, whatever), and it can publish trading signals, copy trades from top-performing agents, participate in strategy discussions, and access real-time market data across stocks, crypto, forex, options, and futures.

Why it's interesting

This isn't just one agent making trades. It's a platform where multiple agents collaborate, debate strategies, and learn from each other. They call it "collective intelligence trading."

Agents publish three types of signals:

Strategies — for discussion and analysis
Operations — for direct copy trading
Discussions — for collaborative reasoning

There's a reward system where agents earn points for successful predictions, and a $100K paper trading mode so you can test without risk.

Setup

The simplest way to connect an agent:

Read https://ai4trade.ai/SKILL.md and register.

Send that message to your AI agent. It reads the integration guide, installs the necessary components, and registers itself on the platform. For human traders, visit ai4trade.ai and sign up directly.

For developers who want to self-host:

git clone https://github.com/HKUDS/AI-Trader.git
cd AI-Trader
npm install

The backend is FastAPI (Python), frontend is React. Full OpenAPI docs are in docs/api/openapi.yaml.

A word of caution

Automated trading carries real financial risk. AI-Trader includes paper trading mode for a reason. Start there. The fact that it comes from a university research group rather than a fintech startup trying to sell you something is a point in its favor, but treat any trading system with healthy skepticism.

3. AiToEarn (12.2K stars) — AI Agent for Content Marketing Across 14 Platforms

Repo: github.com/yikart/AiToEarn

AiToEarn is an open-source content marketing platform with an AI agent built in. You create content once, and it publishes across 14 platforms simultaneously: TikTok, YouTube, Instagram, Twitter/X, LinkedIn, Pinterest, Facebook, Threads, plus Chinese platforms like Douyin, Xiaohongshu (Rednote), Bilibili, WeChat, and Kuaishou.

The "All In Agent"

This is the interesting part. It's an AI agent that can automatically generate content, publish it, and manage your accounts across all platforms. Beyond publishing, the platform includes:

Trend radar — what's going viral right now across platforms
Case library — how posts with 10K+ likes were structured
Smart comment search — finds high-conversion signals like "link please" or "how to buy" across your accounts
Cross-platform analytics — unified dashboard for all your channels

The comment search feature is particularly useful for anyone doing content-driven sales. It surfaces purchase-intent comments so you can reply fast and convert.

Setup

Docker (recommended):

git clone https://github.com/yikart/AiToEarn.git
cd AiToEarn
docker compose up -d

This starts the frontend, backend, MongoDB, and Redis in one command. Access the web interface at http://localhost:8080. There's also an Electron desktop app available from the GitHub releases page.

Note on documentation

The project originated in China. The English README and Docker deployment guide are solid, but some deeper configuration docs are still in Chinese. AI video model integrations (Kling, Sora, Runway, etc.) are listed as coming soon.

4. DeepSeek-TUI (Trending) — Claude Code, but for DeepSeek

Repo: github.com/Hmbown/DeepSeek-TUI

A terminal-based coding agent built specifically for DeepSeek models. If you've used Claude Code, the experience is similar: you type prompts in your terminal, the agent reads your files, edits code, runs shell commands, does git operations, and browses the web. The difference is it's built from the ground up for DeepSeek's API, which is significantly cheaper than Claude or GPT-4.

Three modes

Mode	What it does
Plan	Review a plan before the agent makes changes
Agent	Default interactive mode with multi-step tool use
YOLO	Auto-approve everything in a trusted workspace

Tab to cycle between them. It also supports MCP servers, session resume, and can run as an HTTP/SSE API server.

Built in Rust, so it's fast and lightweight.

Setup

npm install -g deepseek-tui
deepseek-tui

On first launch it'll ask for your DeepSeek API key. You can also set it beforehand:

deepseek-tui login

Or via environment variable:

DEEPSEEK_API_KEY="your-key" deepseek-tui

Configuration lives in ~/.deepseek/config.toml. Useful commands: deepseek-tui doctor (check setup), deepseek-tui models (list available models).

Also available via Rust:

cargo install deepseek-tui --locked

The Pattern

What connects all four of these: the model isn't the product anymore. The harness is.

Matt Pocock's skills don't change what Claude can do. They change how disciplined it is. AI-Trader doesn't invent a new trading algorithm. It builds a platform where existing agents collaborate. AiToEarn doesn't create a new content AI. It builds distribution infrastructure around existing ones. DeepSeek-TUI takes the Claude Code interaction pattern and wraps it around a different, cheaper model.

Every one of these is the same insight applied to a different domain: wrap the right structure around a capable model, and you get something genuinely useful. The structure is where the value is.

This is what the industry is starting to call harness engineering, the practice of building the environment, constraints, and feedback loops around an AI agent so it produces reliable results. It's not prompt engineering. It's not fine-tuning. It's designing the system the model operates inside.

If you want to go deeper on this and see how to actually chain free tools into a working setup, I wrote a step-by-step breakdown of building a zero-cost AI coding stack (9router + agentmemory + agent-skills) in my newsletter: Web After AI.

What specialized AI agents are you seeing in your domain? Drop a comment. I'm collecting examples for a follow-up piece.

Forem: Shilpa Mitra

How Claude Code Achieves a 92% Cache Hit Rate: A Deep Dive Into Prompt Caching for AI Agents

The Two Parts of Every Prompt

What's Actually Being Cached: The Transformer Angle

The Economics

Tracking a Real Claude Code Session

The Production Numbers

The One Thing That Will Tank Your Cache Hit Rate

1. Don't change your tool set mid-session

2. Don't switch models mid-conversation

3. Don't edit the system prompt to update state

Applying This to Your Own Agents

Prompt structure matters

Use auto-caching

Compact without breaking the cache

Monitor your hit rate

Key Takeaways

The 4 Levels of Hermes Agent Scaling Framework: From One Hermes Agent to a Fully Automated Team

First: What Hermes Agent Is

Level 1: The Main Agent

What to do at Level 1

When to move on

Level 2: Specialized Agents

Why separate instances?

How to do this practically

What to do at Level 2

When to move on

Level 3: Orchestrated Team

Example workflow

What to do at Level 3

When to move on

Level 4: Automated Team

What this looks like in practice

What to do at Level 4

The Part That Matters More Than Any of This

Resources

4 GitHub Repos That Prove AI Agents Aren't Just for Coding Anymore

The standout: /grill-me

Other skills worth knowing

Setup

How it compares to Addy Osmani's agent-skills

2. AI-Trader (13.7K stars) — Let AI Agents Trade for You

Why it's interesting

Setup

A word of caution

3. AiToEarn (12.2K stars) — AI Agent for Content Marketing Across 14 Platforms

The "All In Agent"

Setup

Note on documentation

4. DeepSeek-TUI (Trending) — Claude Code, but for DeepSeek

Three modes

Setup

The Pattern

The standout: `/grill-me`