Forem: Edwar Diaz

Why Ignoring Token Costs Can Kill Your AI Product (and How to Fix It)

Edwar Diaz — Wed, 25 Mar 2026 23:21:31 +0000

When building applications powered by LLMs from providers like OpenAI, Google, or Mistral AI, there’s a detail that often gets overlooked:

token cost.

At small scale, it’s barely noticeable. But once your application starts getting real usage, token consumption grows quickly—and if you’re not measuring it, you can easily end up with a feature that costs more than the value it delivers.

The real problem with token usage

Every interaction with an LLM typically involves:

input tokens (your prompt)
output tokens (the model’s response)
sometimes cache tokens, depending on the provider

Individually, these costs are small. But combined with:

longer prompts
verbose outputs
high request volume

they scale faster than most people expect.

And there’s an important nuance here:

Not all models cost the same, and not all tasks require the same type of model.

Model selection is a cost decision

It’s common to default to the most capable model available, but that’s rarely the most efficient choice.

For example:

you don’t need a reasoning-heavy model for simple transformations
you don’t need multimodal capabilities if you're only processing text
many providers offer smaller or optimized variants (mini, nano, etc.)

Choosing the right model affects:

cost
latency
throughput

This is where cost awareness becomes part of system design, not just an afterthought.

Why you should estimate costs early

If you’re building anything beyond a prototype, you should be able to answer:

how much does each request cost?
what is the expected daily usage?
what does that translate to monthly or yearly?

Frameworks like LangChain, Azure AI Foundry, or AWS Bedrock usually provide token usage metrics (input/output/cache). That’s helpful, but incomplete.

In many cases, you still need to map those numbers to actual pricing yourself.

Calculating token costs

If you already have token counts, the calculation is straightforward:

cost = (input_tokens / 1000 * input_price) + (output_tokens / 1000 * output_price)

The challenge is when you don’t have those token counts directly.

In that case, you can approximate them by tokenizing:

the input text you send
the expected output

This gives you a reasonable baseline for estimation.

Tools that make this easier

There are a couple of tools that simplify this process.

LLM Prices

https://www.llm-prices.com/

This tool lets you:

input token counts
select specific models
estimate cost per request
define custom pricing if needed

Token Budget Calculator

https://tokenbudget.edwardiaz.dev/

A more complete approach is to use a tool that combines token estimation with cost projection.

With this kind of platform, you can:

paste input and output text
automatically estimate token usage
calculate:
- cost per request
- daily cost
- monthly cost
define request frequency (per day / per month)

It also allows you to:

compare across a large set of models (100+)
filter by provider or capabilities
sort by cost efficiency
get a recommendation for the most cost-effective model

In addition, there is API support, which makes it possible to integrate cost estimation directly into your own systems. This is especially useful if you want to:

track cost per request internally
build usage dashboards
enforce budgets or limits at the application level

Planning for scale

Once you start tracking token usage and costs, you can:

forecast infrastructure expenses
define budgets
prevent unexpected spikes
choose models more intentionally

This is what turns an experimental feature into something sustainable.

Tokens also impact rate limits

Cost is only one side of the problem.

Many providers enforce limits based on tokens, such as tokens per minute. If your prompts or outputs are too large, you may run into:

throttling
increased latency
failed requests under load

Reducing token usage helps both with cost and system stability.

What comes next

Understanding cost is the first step. The next one is optimization.

In a follow-up post, I’ll go deeper into:

prompt optimization techniques
reducing token usage without losing quality
practical ways to make LLM integrations more efficient

Final thoughts

If you’re not measuring token usage, you’re making decisions without visibility.

Tracking tokens, estimating costs, and choosing the right model are not optional if you care about building scalable AI systems.

It’s a small investment early on that can save you a lot later.

I Built a Context-Aware AI Browser Mentor Powered by GitHub Copilot CLI

Edwar Diaz — Fri, 13 Feb 2026 20:58:37 +0000

This is a submission for the GitHub Copilot CLI Challenge

I Built a Context-Aware AI Browser Mentor Powered by GitHub Copilot CLI

What if GitHub Copilot CLI could see what you see, understand your context, and help you without breaking your workflow?

That question led me to build DevMentorAI — a browser extension that transforms Copilot CLI into a real-time AI mentor inside your browser.

Built entirely with GitHub Copilot CLI — from extension to backend to landing page to release workflows.

🧠 What I Built

DevMentorAI is a context-aware AI assistant that lives inside your browser and understands:

What page you're on
What text you've selected
What you're trying to write
What you're troubleshooting
What you want to improve

Instead of copying context into prompts, DevMentorAI sends context automatically to Copilot CLI.

✨ Core capabilities

📄 Context capture from the current page
📸 Screenshot understanding
✍️ Grammar correction & rewriting
🔄 Replace text directly inside inputs (emails, chats, forms)
🛠 Works for development, DevOps, writing, learning, and more

No framework lock-in. No domain restriction. Just AI assistance anywhere.

🎥 Demo

🌐 Project Links

Landing page
https://devmentorai.edwardiaz.dev/
Installation Guide
https://devmentorai.edwardiaz.dev/installation
GitHub Repository
https://github.com/BOTOOM/devmentorai
Backend NPX Package
https://www.npmjs.com/package/devmentorai-server
Extension Downloads
https://github.com/BOTOOM/devmentorai/releases

▶️ Full Walkthrough

📺 Video:
.

⚡ Feature Highlights

Context-aware assistance

Grammar correction replacing text directly in inputs

Installation from zero using NPX backend

🏗 How It Works

Extension captures page context + optional screenshot
Sends to local backend
Backend communicates with Copilot CLI
AI response returned
Optional direct replacement into page inputs

This creates a seamless AI workflow without leaving the browser.

🔐 Privacy & Security

DevMentorAI runs locally and respects user control:

No credentials required
Uses your Copilot CLI session
Backend runs locally via NPX
Users control what context is shared

🤖 My Experience with GitHub Copilot CLI

Using GitHub Copilot CLI was both enriching and fun. I discovered capabilities far beyond what I previously experienced using Copilot inside editors.

I started with little experience using Copilot CLI, but by following the official documentation and experimenting with slash commands, I learned how to:

Create custom agents
Implement skills (including WXT extension knowledge)
Use advanced TypeScript skills
Plan and execute complex builds through the CLI

One of the most impressive features was agent-based planning mode. Copilot could:

Plan the entire feature
Execute it
Iterate quickly

All from the terminal — lightweight and extremely fast.

🔍 What surprised me most

Copilot CLI enabled building an entire full-stack project:
- Browser extension
- Backend server
- Landing page
- Release workflows
- NPX package
The planning file memory system was incredibly powerful.

🧪 Workflow I Discovered

As sessions grew large, context sometimes became less effective. I learned to:

Start a new session per major feature
Refine functionality within that session
Commit once complete

This dramatically improved results.

🛠 Problem Solving with Copilot

Occasionally, Copilot would get stuck in a loop trying the same solution. When that happened, guiding it to consider alternative perspectives helped it resolve issues successfully.

🎯 Why This Matters

DevMentorAI demonstrates a new paradigm:

AI assistance that adapts to your context instead of forcing you to adapt to it.

GitHub Copilot CLI made this possible.

👤 Author

Edwar Diaz
DEV: @botoom
GitHub: https://github.com/BOTOOM