Forem: Mukunda Rao Katta

Six Reliability Primitives for LLM Agents

Mukunda Rao Katta — Fri, 08 May 2026 08:31:49 +0000

Reliability concerns for LLM agents are typically bundled into one heavy framework that asks you to adopt prompting, tool routing, and runtime governance as a single dependency. Production teams want them à la carte. They want small primitives they can drop in around existing tool calls without buying into a new programming model.

That observation is the design centre of agent-stack: six small, single-concern reliability libraries published independently to npm, PyPI, and the Model Context Protocol registry. Each library is zero-dependency, under 500 lines of code, and addresses one specific failure mode that production agent teams have to handle.

This post is a tour of the six primitives, the cross-cutting invariants they enforce, and the trade-offs of "composable by inclusion" instead of "composable by framework."

The six primitives

Library	Concern	Failure mode it addresses
AgentFit	Context-window fitting	Token-aware truncation. Pluggable tokenizers for OpenAI / Anthropic / open models.
AgentGuard	Network egress allowlist	Blocks "agent suddenly POSTs PHI/secrets to attacker.com" at the network layer.
AgentSnap	Snapshot tests for tool calls	Catches silent regressions when a model's tool-call shape changes after a deploy.
AgentVet	Tool-arg validation	Throws `ToolArgError` with an LLM-friendly retry hint, so the next turn can self-correct.
AgentCast	Structured-output retry	BYO-LLM JSON validator + retry loop for malformed responses during brown-outs.
AgentBudget	Token + dollar caps	Prevents runaway loops billing $1000 on a single user query.

Three runtimes per primitive

Every library ships in three runtime forms:

# TypeScript (npm)
npm i @mukundakatta/agentvet @mukundakatta/agentguard @mukundakatta/agentbudget

# Python (PyPI)
pip install agentvet agentguard agentbudget

// MCP server (Claude Desktop config)
{
  "mcpServers": {
    "agentvet": { "command": "npx", "args": ["-y", "@mukundakatta/agentvet-mcp"] },
    "agentguard": { "command": "npx", "args": ["-y", "@mukundakatta/agentguard-mcp"] }
  }
}

The Python and TypeScript surfaces are 1:1 by design — a Python team and a TypeScript team can interoperate on the same primitives. The MCP variant lets a remote LLM use the primitive as a tool: ask AgentVet to validate a tool call before executing it, ask AgentGuard to check an outbound URL before sending the request.

Composable by inclusion, not by framework

Every primitive has the same shape: a single class or function as the public surface, a typed error carrying retry-friendly context, and an opt-in automatic adapter for popular provider response shapes. Nothing depends on anything else. You can use AgentBudget without AgentFit. You can use AgentVet's MCP variant without ever touching the npm package.

Compare with framework-style libraries that bundle reliability concerns: you get them all or none, you adopt their orchestration model, you migrate to their abstractions. agent-stack inverts that.

Why this matters in practice

Healthcare amplifies every reliability concern. A FHIR-querying agent has to be defensive about: only calling sanctioned endpoints, never leaking PHI in logs, abstaining when the right tool is not on the list, validating that a patient_id looks like a real FHIR id before it hits the tool, never exceeding a budget on a single patient query, and producing structured outputs the downstream system can actually parse. Existing agent frameworks address one or two of those concerns. agent-stack gives you all six in libraries you adopt independently.

The same pattern applies to any production setting where the cost of an agent silently failing is higher than the cost of a slightly-too-defensive primitive: financial services, internal corporate tools, customer support automation, anything billable.

Artifact paper

The full design rationale, the cross-cutting invariants, and the operational questions that emerge when reliability is split across many small dependencies are documented in a peer-reviewable artifact paper:

Zenodo DOI: 10.5281/zenodo.20074702
HuggingFace Hub: huggingface.co/mukunda1729/agent-stack (HF DOI 10.57967/hf/8720)
Source: github.com/MukundaKatta/agent-stack

The paper is currently under review at the ASE 2026 Tools track. Source for every library is archived in Software Heritage. Every package has CI on GitHub Actions, snapshot tests, and a CITATION.cff.

What's next

The natural next primitive is AgentTrace for cost and latency telemetry. After that, a combined @mukundakatta/agent-stack meta-package that imports all six with sensible defaults for production agents. The healthcare-aware AgentGuard preset (curated allowlist of FHIR endpoints, PHI redaction policies) is a natural specialization.

If you ship LLM agents in production and you want one of the primitives without buying into a framework, install only the library you need. Each one stands alone.

Source on GitHub. Paper on Zenodo. DOI on HuggingFace. Six primitives, three runtimes, one unified surface for production reliability.

Choosing Gemma 4 for Local AI Apps: A Builder's Field Guide

Mukunda Rao Katta — Thu, 07 May 2026 23:34:58 +0000

This is my submission for the Write About Gemma 4 prompt in the DEV Gemma 4 Challenge.

Gemma 4 is exciting because it gives builders a real model family to reason about, not just a single model name to drop into a README. The useful question is not "Can I call an AI model?" The useful question is "Which Gemma 4 model belongs at the center of this workflow, and why?"

Here is the field guide I wish I had before building with it.

Start With the Job

Before picking a model size, write down the job in plain language.

For example:

summarize long, messy project notes
classify a short message on-device
reason across several files
read screenshots or mixed media
generate a structured plan from incomplete context
help a user make a decision without exposing private data

That first sentence matters. If the job is tiny, you probably do not need the biggest model. If the job involves long context, ambiguity, tradeoffs, and user trust, a larger reasoning model can be worth the extra cost.

When I Would Reach for Gemma 4 31B

I would start with Gemma 4 31B when the task needs richer reasoning.

Good fits:

long notes or documents
planning and critique
multi-step analysis
structured outputs with explanations
product workflows where the model must surface assumptions

The key is that 31B should be doing more than autocomplete. It should help the user see something more clearly: a risk, a tradeoff, a missing fact, or a next step.

In my BriefBench project, I used 31B as the default reasoning path because the app asks Gemma 4 to turn messy context into a decision brief. The model has to preserve constraints, explain a model choice, identify risks, and return sections the UI can render.

When Smaller Gemma 4 Models Make More Sense

Smaller Gemma 4 models are not lesser choices. They are different product choices.

I would look at smaller or edge-friendly Gemma 4 options when:

latency matters more than perfect prose
the task is repeated many times
the user is on modest hardware
privacy requires local execution
the output is narrow and easy to validate

Examples:

tagging support tickets
drafting short replies
extracting fields from predictable text
running an offline assistant on a small device
doing first-pass filtering before a larger model reviews only the hard cases

This is where local AI becomes more than a slogan. A smaller local model can be the right answer when the app should feel instant, private, and cheap to run.

Design the UI Around Model Uncertainty

A lot of AI apps make the same mistake: they treat model output like a final answer.

For practical tools, I prefer interfaces that expose:

assumptions
confidence boundaries
missing inputs
risks
alternate options
human review moments

This makes the model more useful because the user can inspect the reasoning instead of simply trusting the final paragraph.

Gemma 4 is especially useful in workflows where the output becomes structured UI. Instead of asking for one long answer, ask for JSON sections:

{
  "summary": "...",
  "modelRationale": "...",
  "risks": [],
  "buildPlan": [],
  "assumptions": []
}

That small shift changes the product. The model is no longer just writing. It is powering an interface.

Hosted Prototype, Local Production

One practical pattern is:

prototype with a hosted API
learn the exact prompts and output schema
move privacy-sensitive workflows toward local inference

This keeps early development fast while still respecting the reason many people care about Gemma: the ability to build useful AI without sending every sensitive note, image, or document to a remote service.

The important thing is to be honest in the product. If the demo uses a hosted API, say so. If the production vision is local-first, explain which data should stay on-device and why.

A Simple Selection Checklist

When choosing a Gemma 4 model, I would ask:

How long is the input?
How complex is the reasoning?
Does the user need the answer instantly?
Can the output be automatically checked?
Is the data sensitive?
Will this run once, or thousands of times?
Is the model generating content, making a decision aid, or controlling a workflow?

If the answer involves long context, messy reasoning, and explanation, start larger.

If the answer involves fast repeated classification, extraction, or local privacy, start smaller.

If the task is safety-critical, design the product so Gemma 4 assists a human rather than silently deciding for one.

What Makes a Good Gemma 4 Project

A strong Gemma 4 project should make the model choice visible.

That means the README, demo, or article should answer:

What does Gemma 4 do that is central to the product?
Which model did you choose?
What tradeoff did you accept?
What would you change for production?
How does the user inspect or correct the output?

Those questions are more interesting than a generic "AI-powered" label. They show that the builder understood the model as part of the system, not as decoration.

Closing Thought

The best Gemma 4 apps will not all use the largest model. They will use the right model for the job.

Sometimes that means 31B for deeper reasoning. Sometimes it means a smaller local model because privacy, latency, and cost are the real product requirements.

That is what makes Gemma 4 fun to build with: it invites us to think like product engineers again.

I Built BriefBench: A Gemma 4 Tool That Turns Messy Notes Into Model Decisions

Mukunda Rao Katta — Thu, 07 May 2026 23:10:00 +0000

This is my submission for the Build With Gemma 4 prompt in the DEV Gemma 4 Challenge.

BriefBench is a small local-first web app that helps builders turn rough project notes into a structured decision brief. The goal is not to hide Gemma 4 behind a generic chat interface. The goal is to make the model's work visible: read context, explain the model choice, surface risks, propose a build plan, and suggest a story angle for a technical post.

What I Built

You paste messy notes about a project: users, constraints, privacy concerns, available hardware, and what you are trying to decide. BriefBench asks Gemma 4 to return strict JSON with:

summary
model rationale
risks
build plan
user experience notes
DEV post angle
assumptions

The UI then renders those sections as a scannable brief and lets you copy the result as Markdown.

Demo

Try the app here:

https://gemma-4-briefbench.vercel.app

The public demo runs in demo mode by default so anyone can test the full workflow immediately. If you run it locally with a Gemma 4 API key, the same interface can call Gemma 4 through Google AI Studio or OpenRouter.

Code

Repository:

https://github.com/MukundaKatta/gemma-4-briefbench

How I Used Gemma 4

The challenge page asks builders to make a clear case for model selection, so I designed the app around that idea.

Gemma 4 is a good fit because the work is context-heavy but not just raw generation. The model has to read incomplete notes, preserve constraints, reason about tradeoffs, and produce structured output that a user can inspect.

For the prototype, I defaulted to gemma-4-31b-it because the 31B dense model is the best fit for richer reasoning over messy planning notes. For production edge scenarios, the app also talks about when Gemma 4 E2B or E4B would be more appropriate, especially for privacy-sensitive workflows running on phones, small devices, or a Raspberry Pi.

How It Works

The project is intentionally simple:

a no-framework Node server
a static HTML/CSS/JS frontend
one /api/analyze endpoint
provider options for demo mode, Google AI Studio, and OpenRouter
API keys stay on the server

The system prompt asks Gemma 4 to behave as a local-first project brief analyzer and return strict JSON. The frontend turns that JSON into sections instead of displaying a wall of text.

Demo mode exists so anyone can review the app immediately, even without an API key. With a key configured, the same interface can call Gemma 4 through Google AI Studio or OpenRouter.

What Gemma 4 Unlocks

The useful part is not that the app can summarize notes. The useful part is that it makes the decision process visible.

For hackathon builders, this matters. It is easy to say "I used an AI model." It is harder, and more valuable, to explain:

why this model was chosen
what constraints shaped that choice
what risks remain
what the user should do next
how the model output becomes product UI

BriefBench turns that reasoning into the core product experience.

What I Would Add Next

The next version would add:

local Ollama or llama.cpp support for fully offline Gemma 4 runs
file upload for Markdown, CSV, and PDF notes
side-by-side model comparison between 31B, 26B MoE, and smaller edge models
saved briefs for teams evaluating multiple AI project ideas
export templates for DEV posts, README files, and product specs

Try It

npm start

Then open:

http://localhost:4174

The app runs in demo mode by default. To use Gemma 4 through an API, set GEMMA_PROVIDER and the matching API key in .env.

Closing Thought

The best local AI tools should make users feel more capable, not more dependent. BriefBench uses Gemma 4 as a thinking partner for the unglamorous but important middle step: turning rough context into a decision someone can actually act on.

I Got Burned by Prompt Injection in Production. Here Are 2 Tiny npm Libs That Stopped It.

Mukunda Rao Katta — Wed, 06 May 2026 19:18:22 +0000

A user pasted a help article into our agent. Three minutes later the agent silently rewrote a customer email, leaked an internal URL, and tried to fetch a .zip from a domain none of us had ever seen.

Nothing in the LLM was wrong. The problem was upstream. Retrieved text walked into the prompt with no inspection, and the agent treated it as gospel.

I wrote up the lessons as a short preprint. The two npm libs below are the working code behind it.

The two libs

`@mukundakatta/prompt-injection-shield`

A small-rule scanner for prompt-injection patterns in untrusted text. No heuristics, no ML, no weights. Just regex-grade rules with a typed risk_reasons array so you can log, gate, or strip lines.

npm install @mukundakatta/prompt-injection-shield

import { scan } from '@mukundakatta/prompt-injection-shield';

const r = scan(retrievedDoc);
if (r.risk_score > 0) {
  console.warn('blocked:', r.risk_reasons);
  return;
}

What it catches:

"ignore previous instructions" and family
system-prompt impersonation
tool-call hijack patterns
url-based exfil hints
secret patterns the model should not see

When a rule fires, you get the line, the rule id, and a recommendation. Strip, redact, drop, or feed it to your audit trail. Up to you.

`@mukundakatta/vector-poison-score`

Same idea, retrieval side. Score chunks before they go into context.

npm install @mukundakatta/vector-poison-score

import { score } from '@mukundakatta/vector-poison-score';

const s = score(chunk);
if (s.poison_score >= 0.5) skip(chunk);

What it scores:

oversized chunks (token bloat attacks)
secret-exfiltration patterns inside retrieved text
suspicious link clusters
mixed-language anomalies in technical docs

Weights are tunable. Defaults are conservative. Both libs have zero runtime dependencies.

Why "small rules"

Big ML defenses are expensive, opaque, and hard to audit when something slips. Small rules are the opposite. You can read them. You can grep them. You can fork the file when your threat model is different from mine.

Same logic as a linter. Not perfect. Not sexy. Catches a huge chunk of the dumb stuff before the model has to think about it.

Where they sit in the pipeline

retrieval -> [vector-poison-score] -> reranker
                                      |
                                      v
              tool output -> [prompt-injection-shield] -> prompt
                                                          |
                                                          v
                                                         LLM

Two checkpoints. Cheap. Easy to disable per request. No effect on latency above the noise floor.

The preprint

Full writeup with threat model, rule design, and limitations:

Zenodo DOI: 10.5281/zenodo.20057056
Figshare DOI: 10.6084/m9.figshare.32193543
GitHub bundle: MukundaKatta/rag-guardrails-paper

License is CC BY 4.0 on the paper, MIT on the code. Both libs are tiny. Both are forkable in five minutes.

What this is not

Not a replacement for a full security review. Not a benchmark claim. Not a model. The whole thesis is that an inspectable, boring baseline between retrieval and prompt construction is worth more than nothing, and most teams ship with nothing.

If you build agentic RAG, drop these in front of your prompt. Then run a real audit later.

I Built 5 Tiny Libraries to Stop My AI Agents from Misbehaving in Production

Mukunda Rao Katta — Wed, 06 May 2026 08:28:15 +0000

I've been building production AI agents for a while now. RAG pipelines, agentic workflows, multi-model routers. I kept hitting the same five problems over and over. Not architectural problems. Plumbing problems.

The kind that don't show up in tutorials but wreck you in production:

The agent stuffs too much into the context window and the API errors out
A tool fires an HTTP request to some random domain it hallucinated
You change a prompt and have no idea if the agent's behavior changed
The LLM passes malformed args to a tool and it blows up downstream
The model returns { "result": "here is the JSON: {...}" } instead of clean JSON and your parser chokes

I fixed each of these once, copy-pasted the fix into three other projects, and then got tired of that. So I extracted them into five small libraries. Zero dependencies each. TypeScript-first. Under 300 lines per package.

Here's the stack, in the order you'd use it at runtime:

1. agentfit - fit messages to the context window

Problem: Your conversation history grows until the API throws a context_length_exceeded error at the worst possible moment.

Fix: Token-aware truncation before the API call.

import { fitMessages } from "@mukundakatta/agentfit";

const messages = await fitMessages(history, {
  maxTokens: 8000,
  strategy: "drop-middle", // keeps system + recent, drops stale middle
  tokenizer: "cl100k",
});

const response = await openai.chat.completions.create({
  model: "gpt-4o",
  messages,
});

Strategies: drop-oldest, drop-middle, summarize-oldest. Pluggable tokenizer if you're not on OpenAI's encoding.

npm · GitHub

2. agentguard - network egress firewall

Problem: Your agent calls a tool that makes an HTTP request. The LLM hallucinates a URL. Your server pings something it shouldn't.

Fix: Declare an allowlist. Throw on anything outside it.

import { createGuard } from "@mukundakatta/agentguard";

const guard = createGuard({
  allow: ["api.openai.com", "your-internal-api.com", "s3.amazonaws.com"],
});

// wrap your fetch
const safeFetch = guard.wrap(fetch);

// now any attempt to hit an unlisted domain throws AgentGuardError
await safeFetch("https://hallucinated-domain.xyz/exfiltrate"); // throws

Useful in CI too - run your agent tests with a strict allowlist and catch unexpected egress before it hits prod.

npm · GitHub

3. agentsnap - snapshot tests for tool-call traces

Problem: You tweak a prompt and have no idea if it changed the agent's tool-call behavior. No diff, no signal, just vibes.

Fix: Snapshot the tool-call trace, not just the final output.

import { snap } from "@mukundakatta/agentsnap";

test("research agent calls search before summarize", async () => {
  const trace = await runMyAgent("summarize recent AI papers");

  await snap(trace, "research-agent-trace");
  // First run: writes the snapshot.
  // Subsequent runs: diffs against it. Fails if tool calls changed.
});

Snapshots store the ordered sequence of tool names, arg shapes, and return types, not the raw values (which are flaky). Works with any agent runner: LangGraph, LlamaIndex, raw OpenAI function calls, whatever.

npm · GitHub

4. agentvet - validate tool args before execution

Problem: The LLM calls send_email but forgets the subject field. Your handler throws a cryptic error. The agent retries blindly.

Fix: Validate before execution. Return an LLM-friendly error message so it can self-correct.

import { vetTool } from "@mukundakatta/agentvet";

const sendEmail = vetTool(
  {
    name: "send_email",
    schema: {
      to: { type: "string", required: true },
      subject: { type: "string", required: true },
      body: { type: "string", required: true },
    },
  },
  async ({ to, subject, body }) => {
    await mailer.send({ to, subject, body });
  }
);

// If the LLM omits `subject`, the call returns:
// "send_email rejected your args: missing required field: subject.
//  Please call again with the corrected arguments."
// - ready to feed straight back into the next turn.

npm · GitHub

5. agentcast - structured output enforcer

Problem: You ask for JSON. You get Sure! Here's the JSON you asked for: { ... }. Or worse, truncated JSON. Your parser throws. The agent moves on as if nothing happened.

Fix: Validate-and-retry loop with schema enforcement.

import { cast } from "@mukundakatta/agentcast";

const result = await cast({
  prompt: "Extract the company name and founding year from this text: ...",
  schema: {
    company: { type: "string" },
    founded: { type: "number" },
  },
  llm: async (messages) => openai.chat.completions.create({ ... }),
  maxRetries: 3,
});

// result is guaranteed to match the schema or throw after maxRetries
console.log(result.company, result.founded);

On a bad response it builds a corrective prompt automatically and retries. BYO LLM client and BYO validator. The library is the loop, not the dependencies.

npm · GitHub

How they fit together

At runtime, the order makes sense:

fitMessages   →  trim history before the API call
    ↓
agentguard    →  wrap fetch so tools can't call arbitrary URLs
    ↓
agentvet      →  validate tool args before the handler runs
    ↓
agentcast     →  enforce structured output after the LLM responds
    ↓
agentsnap     →  in tests, snapshot the trace to catch regressions

You don't have to use all five. Each is a drop-in. I use all five in my production pipelines and two or three in smaller scripts.

Why not one big framework?

Because one big framework becomes the thing you fight. Each of these solves exactly one problem, has no opinions about the rest of your stack, and disappears when you don't need it.

All five are on npm under @mukundakatta/. MIT licensed. PRs open.

If you're building agents in production and hitting these same walls or different walls, I'd like to hear about it in the comments.