Why AI Agents Cost More Than LLMs (And How to Stop Bleeding Tokens)

Abhishek — Mon, 11 May 2026 07:23:58 +0000

I was building a small bookmark app last weekend. You send it a URL, Gemini
summarizes and tags the page, the result goes into Postgres. A few hundred lines
of TypeScript.

The first version cost almost nothing. One LLM call per URL, that's it. Then I
added "tools" so the model could fetch pages, look up similar bookmarks, or
check things against Google Search.

My token bill quadrupled.

That's where most people building agents land for the first time. Going from a
plain chat call to an agent loop is way more expensive than docs make it sound,
and the reason isn't obvious until you watch the round trips happen one by one.
Let's do that.

What a plain LLM call costs

Here's the simplest LLM call in TypeScript with @google/genai:

import { GoogleGenAI } from '@google/genai';

const ai = new GoogleGenAI({ apiKey: process.env.GEMINI_API_KEY! });

const res = await ai.models.generateContent({
  model: 'gemini-2.5-flash',
  contents: 'Summarize this article: ...',
});

console.log(res.text);

One request out, one response back. You pay for two things:

Input tokens for your prompt
Output tokens for the model's reply

That's it. Two numbers on your bill. If your prompt is 500 tokens and the answer
is 200, you pay for 700 tokens. Done.

Now add a single tool

Tools are how the model talks to the outside world. Calling an API, querying a
database, fetching a URL, anything. You describe each tool with a small JSON
schema, and the model can ask to "call" one mid-conversation. You actually run
the function, send the result back, and the model writes its final answer using
that result.

The basic version:

import { GoogleGenAI, Type } from '@google/genai';

const tools = [{
  functionDeclarations: [{
    name: 'getWeather',
    description: 'Get the weather of any city',
    parameters: {
      type: Type.OBJECT,
      properties: {
        location: { type: Type.STRING },
      },
      required: ['location'],
    },
  }],
}];

const first = await ai.models.generateContent({
  model: 'gemini-2.5-flash',
  contents: 'What is the weather in Tokyo?',
  config: { tools },
});

console.log(first.text);
// → undefined

undefined?

The model didn't answer. It returned a structured request:

first.functionCalls
// → [{ name: 'getWeather', args: { location: 'Tokyo' } }]

This is the part that surprises people. The model got asked a question, and
instead of answering, it asked you to run a function. So you do that and
ship the result back:

const result = getWeather('Tokyo');  // { temperature: 23, condition: 'sunny' }

const second = await ai.models.generateContent({
  model: 'gemini-2.5-flash',
  contents: [
    { role: 'user',  parts: [{ text: 'What is the weather in Tokyo?' }] },
    { role: 'model', parts: [{ functionCall: { name: 'getWeather', args: { location: 'Tokyo' } } }] },
    { role: 'user',  parts: [{ functionResponse: { name: 'getWeather', response: result } }] },
  ],
  config: { tools },
});

console.log(second.text);
// → "It's 23°C and sunny in Tokyo."

Two LLM calls. One question. That's the agent tax.

Why we can't just do it in one call

The first reaction (mine too): why can't the model just answer in one shot?

The reason is simple. The model can't predict what the tool will return. The
temperature in Tokyo isn't in its training data, the API hasn't been hit yet,
the result doesn't exist. You can't write "It's 23°C in Tokyo" before you know
it's 23°C.

So turn 1 is "decide what to do." Turn 2 is "use what you learned." They can't
be merged. The model has no memory between calls.

One exception is worth knowing about: server-side tools. Things like
googleSearch or urlContext in Gemini run inside Google's own servers, and
the API returns one merged response. From your side it looks like a single call.
You lose some control (you can't see exactly what got searched), but you save
a round trip.

Counting the actual tokens

Here's where the cost lives. Look at what turn 2 has to send compared to turn 1:

	Turn 1 in	Turn 1 out	Turn 2 in	Turn 2 out
System prompt	yes		yes, billed again
Tool schemas	yes		yes, billed again
User question	yes		yes, billed again
Model's tool call		yes	yes, as input
Your tool result			yes
Final answer				yes

Your system prompt and tool definitions get sent to the API twice. Turn 1
doesn't free you from re-sending everything in turn 2, because the model is
stateless. It forgets the whole conversation between calls.

Real numbers from my bookmark agent:

System prompt: ~200 tokens
4 tool declarations: ~400 tokens
User question: ~50 tokens
Tool result (a few rows from Postgres): ~300 tokens

Plain LLM call:    ~650 in  +  ~200 out  =  ~850 tokens
One-tool agent:   ~1300 in  +  ~230 out  = ~1530 tokens (about 1.8x)

And that's the best case. Exactly one tool call, no follow-ups. Real agents are
worse. A lot worse.

Real agents grow quadratically

The bookmark agent does three things on a new URL:

Fetch the page (fetchUrl tool)
Look for similar existing bookmarks in the DB (searchSimilar tool)
Pick a category from the user's existing taxonomy (getTaxonomy tool)

That's 4 LLM turns total. Ask, get tool calls, send back results, ask again,
get more calls, send results, finally write the summary.

What the cumulative input size looks like each turn:

Turn	What gets sent	Input tokens
1	system + schemas + URL	700
2	+ previous calls + `fetchUrl` result (~1500 of page)	2200
3	+ `searchSimilar` result	2400
4	+ `getTaxonomy` result	2600

Total input across all turns: about 7900 tokens to summarize one webpage.

For comparison, a plain generateContent({ contents: "summarize this:\n" + pageText })
costs ~1500 input + 200 output. About 1700 tokens.

Same task. Almost 5x the bill.

It gets worse. Cost grows quadratically with the number of turns, because
each turn replays everything that came before. A 10-turn agent isn't 10x the
cost. It's closer to 30x.

Three ways to stop the bleeding

You're not stuck. Here's what actually works.

1. Prompt caching

The biggest lever by far. Every major provider supports it now: OpenAI,
Anthropic, Google. The system prompt and tool schemas don't change between
turns, so cache them once and pay about 25% of the input cost on every reuse.

With @google/genai:

const cache = await ai.caches.create({
  model: 'gemini-2.5-flash',
  config: {
    systemInstruction: 'You are a bookmark organizer...',
    tools,  // these never change across turns
  },
});

const res = await ai.models.generateContent({
  model: 'gemini-2.5-flash',
  contents: history,
  config: { cachedContent: cache.name },
});

For my 4-turn flow this cuts input costs by roughly half. Anthropic and OpenAI
do the same thing with different syntax.

Gemini also has implicit caching. It auto-caches recent prefixes for you with
zero code changes. You just see cheaper retries. Check if your provider has it
on before reinventing the wheel.

2. Different model per turn

The "decide which tool to call" turn is dumb work. It barely needs reasoning.
It's pattern matching on a question. The final synthesis turn is where you
actually want a smart model.

// Cheap, fast: decides what to do
const decision = await ai.models.generateContent({
  model: 'gemini-2.5-flash-lite',
  // ...
});

// Smarter: writes the actual answer
const finalAnswer = await ai.models.generateContent({
  model: 'gemini-2.5-pro',
  // ...
});

In a 4-turn flow, three of the turns can run on the cheap model. Only the last
one, the user-facing answer, needs the expensive one. For high-volume agents
this saves more than caching does.

3. Parallel tool calls

The model can ask for multiple tools in a single response. Code I see in
tutorials usually does functionCalls[0] and silently drops the rest, turning
what could be one round trip into many.

The fix is one line of Promise.all:

const results = await Promise.all(
  resp.functionCalls.map(async (c) => ({
    name: c.name!,
    response: await dispatchers[c.name!](c.args),
  }))
);

For "summarize all my React bookmarks from last month," the model might call
searchBookmarks and getDateRange in parallel. Handle both, and you save a
whole round trip.

What you can't optimize away

Tools have a real cost, and they buy you real value. The reason you reach for
them is the same reason they're expensive. You're forcing the model to use
facts that exist outside its head instead of making them up.

A plain LLM call will happily tell you the weather in Tokyo. It'll just be
wrong.

Quick way to think about it when picking an architecture:

Plain LLM is a guess from training data. Cheap, fast, hallucinates.
Tools / agent is real data. Expensive, slower, honest.

Most apps shouldn't be agents. If your task is "summarize this text I'm pasting
in" or "rewrite this email," you don't need tools. You need one call. A lot of
agent frameworks make it really easy to add tools by default, which makes it
really easy to spend 5x what you should.

Tools earn their cost when you have side effects (writing to a DB, sending a
message), grounded data (today's weather, this user's bookmarks, current docs),
or chained reasoning where intermediate steps actually need verification.

They don't earn it on anything you could solve with one good prompt.

The receipt

Last week I added one tool to a Gemini call and watched the cost go from 850
tokens to 1530 for the same question. Once I started parallelizing calls and
caching the system prompt, I got the bookmark agent down to about 4500 tokens
across all four turns. Still 2.5x a plain call, but way better than the 7900
the naive version was burning.

Your agent isn't a smarter LLM. It's the same LLM with a longer receipt. Once
you can read the receipt, every optimization becomes obvious.

If you like my content support by like and share 💟 also dont forget to follow me on Twitter/X and LinkedIn. If you want me to connect, checkout my site. See you in next one.

How I Build an NPM Package that lets you Scaffold React Apps

Abhishek — Wed, 03 Sep 2025 17:31:48 +0000

Okay, So I finnaly made an NPM package so that you don't need to find each and every dependencies. In this blog, I am going to walkthrough the process how you can try this package yourself.
I will also share how I came to making one more npm thingy, here we go.

Let first start with a question:

What is An NPM Package?

=> NPM is a package manager for Node. js packages, or modules if you like. www.npmjs.com hosts thousands of free packages to download and use. The NPM program is installed on your computer when you install Node.js. If you installed Node.
That's a preety boring definition, here is the simpler version

An npm package is basically a bundle of reusable code that you (or anyone) can install and use in a Node.js project. Think of it as a little Lego piece—easy to plug in, whether it’s for adding a button component, handling dates, or powering a whole framework.

My First NPM package

I found out the process of setting up each package one by one, why not make a scaffold tool like a thing so that anyone can just hit the command, do some'enter...enter' and get their react project as it it.
I tried to implement some CI/CD files and vercel.json to make the process further helpful.

To install the npm package, you need to enter this command in your terminal-

npx react-starter-plus

The prompts you need to follow

Project name → my-react-app
Language → JavaScript / TypeScript
Git setup → Initialize repo & push to GitHub (provide remote URL)
Extras →

CI/CD with GitHub Actions?
Zustand for state?
React Testing Library?
1. Deployment → Deploy with Vercel (make sure you’re logged in with vercel login)
2. Deploy now or later → Your call.

Get the Summary & Setup

The CLI shows a summary before proceeding. If everything looks good, it:

Installs dependencies
Sets up Tailwind + routing
Initializes Git & pushes to remote
Configures CI/CD
Deploys to Vercel

And done

You’ll end up with something like:

✔ Deployment successful:
https://johndoes-project.vercel.app

→ Run it locally with `npm run dev`

Then you can choose how your React project gonna be-
and finnaly you get this-

That's all for this project, if you find this helpful at any point, leave a start 🌟.

Forem: Abhishek