Forem: Akromdev

I Built an AI Chatbot That Knows Everything About Me

Akromdev — Wed, 01 Apr 2026 14:21:59 +0000

My portfolio site has project pages, work experience entries, and blog posts, all written as MDX files. When someone visits, they usually have a specific question: "Has this person worked with React?" or "What's their most recent project?" The answer is somewhere on the site, but finding it means clicking through pages and scanning project cards.

I wanted visitors to be able to just ask. Not a FAQ page with canned answers, but something that reads the actual content on the site and answers questions from it.

Why Not Just Feed It Everything?

Your first thought might be: take all the content, send it to a language model like GPT-4o or Claude, and let it answer questions. This works for short content. But language models hallucinate. Ask about a technology you never mentioned, and the model might confidently say "yes, they have 3 years of experience with that" because it sounds plausible.

There's also a scale problem. My site has around 30 content files. Sending all of them as context every time someone asks a question is wasteful, and the more content you include, the more room there is for the model to drift.

Search First, Then Answer

Instead of sending everything, what if I first searched my own content to find the pieces relevant to the question, and only sent those to the model? That's the core idea behind RAG (Retrieval-Augmented Generation). The model writes its answer from a small, focused set of context instead of your entire site. Because it only sees what's relevant, it stays grounded in what's actually there.

To make this work, I needed three things: a way to split my content into searchable pieces, a way to search by meaning (not just keywords), and a language model to write the final answer.

Splitting Content Into Chunks

My content lives in MDX files: one per project, one per job, one per blog post. Some of these are long. A single project page might describe the tech stack, what I built, and how it works, all in one file. Sending an entire file as context when the user only asked about the tech stack wastes tokens and adds noise.

So I split each file into smaller chunks at paragraph boundaries, capped at 500 characters:

function chunkText(text: string, maxLen = 500): string[] {
  const paragraphs = text.split(/\n\n+/);
  const chunks: string[] = [];
  let current = "";

  for (const para of paragraphs) {
    if (current.length + para.length > maxLen && current) {
      chunks.push(current.trim());
      current = para;
    } else {
      current += (current ? "\n\n" : "") + para;
    }
  }

  if (current.trim()) chunks.push(current.trim());
  return chunks;
}

One thing I learned through testing: raw chunks with no context confused the model. A chunk that says "Built with TypeScript and PostgreSQL" is meaningless without knowing whether it's describing a personal project or a company I worked at. The fix was adding type prefixes. Every chunk starts with [PROJECT], [WORK EXPERIENCE], [BLOG POST], or [PROFILE], so the AI immediately knows what kind of content it's looking at. I also added catalog chunks (complete lists of all projects or all work history) so questions like "list all my projects" don't return partial results.

Searching by Meaning

Now I have chunks, but how do I find which ones are relevant to a question? Keyword search is the obvious choice, but it's brittle. If someone asks about "React experience" and my project description says "built with NextJS", there's no keyword match, even though NextJS is a React framework.

This is where embeddings come in. An embedding model takes a piece of text and converts it into a list of numbers that represent its meaning. "React" and "NextJS" produce similar numbers because they're related concepts. "PostgreSQL" and "Redis" end up close together because they're both databases. When someone asks about "React experience", the question gets converted to numbers too, and it naturally lands close to anything frontend-related in my content.

To convert text into these numbers, you need an embedding model. My first attempt used the HuggingFace Inference API, which worked, but had a problem: 0.5 seconds when the model was warm, 9.4 seconds when it was cold. HuggingFace spins down free-tier models after inactivity, so the chatbot would randomly hang for nearly 10 seconds. I switched to running the same model locally. all-MiniLM-L6-v2 is a popular open-source option, only 22MB, and it produces 384 numbers per piece of text in about 12ms:

import { pipeline } from "@huggingface/transformers";

const extractor = await pipeline("feature-extraction", "Xenova/all-MiniLM-L6-v2");

async function embedText(text: string): Promise<number[]> {
  const result = await extractor(text, { pooling: "mean", normalize: true });
  return result.tolist()[0]; // 384 numbers
}

At build time, I run this on every chunk and save the results to a JSON file. At runtime, I embed the user's question and find the closest chunks by comparing their numbers using cosine similarity (how much two sets of numbers point in the same direction):

async function searchChunks(query: string, topK = 8) {
  const queryEmbedding = await embedText(query);

  return chunks
    .map((chunk) => ({
      ...chunk,
      score: cosineSimilarity(queryEmbedding, chunk.embedding),
    }))
    .sort((a, b) => b.score - a.score)
    .slice(0, topK);
}

If you're working with thousands of chunks, you'd want a vector database like Pinecone or Weaviate to handle the search. For a personal site with around 160 chunks, looping through all of them in memory works fine.

Generating the Answer

At this point I have the top 8 chunks most relevant to the user's question. The last step is sending them to a language model to write a readable answer.

I went with Groq's free tier running Llama 3.1 8B. The model doesn't know anything about me by default. It only sees whatever chunks I send it. The system prompt tells it how to interpret the content and what the type prefixes mean:

const SYSTEM_PROMPT = `You are a helpful assistant on a personal website.
Answer questions using only the provided context.

Pay attention to type labels:
- [PROJECT]: Portfolio projects
- [WORK EXPERIENCE]: Employment history
- [BLOG POST]: Articles written
- [PROFILE]: Personal info

Keep answers concise and friendly. Do not make up information.`;

The API call:

const response = await fetch("https://api.groq.com/openai/v1/chat/completions", {
  method: "POST",
  headers: {
    "Content-Type": "application/json",
    Authorization: `Bearer ${process.env.GROQ_API_KEY}`,
  },
  body: JSON.stringify({
    model: "llama-3.1-8b-instant",
    messages: [
      { role: "system", content: SYSTEM_PROMPT },
      ...conversationHistory,
      { role: "user", content: `Context:\n${relevantChunks}\n\nQuestion: ${question}` },
    ],
    temperature: 0.3,
  }),
});

Temperature controls how creative the model gets. At 0.3 (out of 1.0), it stays close to the most likely answer, which is what you want when accuracy matters. Conversation history (the last 10 messages) goes in with each request so follow-up questions like "tell me more about that project" work without losing context.

Deploying to Vercel

At this point everything worked locally and I was ready to deploy and move on. The chatbot ran as a serverless function through Astro's Vercel adapter, the model was only 22MB, and the embeddings were a static JSON file. Should have been the easy part.

I deployed and immediately hit Vercel's 250MB size limit on serverless functions. The model is only 22MB, so that wasn't the issue. @huggingface/transformers depends on onnxruntime-node, which ships native binaries for every platform. They all get bundled into your function, and that alone pushes you way past 250MB.

There's a lighter alternative called onnxruntime-web that uses WebAssembly instead of native binaries, around 11MB. But it's built for browsers. Run it in Node.js and it tries to fetch WASM files from a CDN over HTTPS, which Node.js refuses to do.

The workaround: swap onnxruntime-node for onnxruntime-web with a pnpm override, copy the WASM files to a local directory during the build, and tell the runtime to load them from the filesystem instead of the CDN:

const wasmDir = join(process.cwd(), ".wasm");
onnxEnv.wasm.wasmPaths = {
  wasm: `file://${wasmDir}/ort-wasm-simd-threaded.wasm`,
  mjs: `file://${wasmDir}/ort-wasm-simd-threaded.mjs`,
};
onnxEnv.wasm.numThreads = 1;

With Vercel's includeFiles bundling the model and WASM into the function, the same local inference that works on my laptop works in production. No embedding API, no cold starts, no cost.

What It Costs

Embedding a query: ~50ms
Searching 164 chunks: under 1ms
LLM response: ~400ms
Total: under 500ms

Monthly cost: $0. Groq's free tier covers the LLM, embeddings run inside the serverless function, and chunk data is a static JSON file built at deploy time.

The whole thing is around 250 lines of TypeScript. There's a chat button on my site if you want to try it.

Originally published on akrom.dev. For quick dev tips, join @akromdotdev on Telegram.

Cursor vs Claude Code: Why I Switched After a $500 Bill

Akromdev — Tue, 10 Feb 2026 22:26:26 +0000

How We Got Here

Note: This section is a brief history of AI coding tools. If you just want the Cursor vs Claude Code comparison, skip to The $500 Problem.

I still remember the first time I copied code from ChatGPT into my editor. It was late 2022, and the whole thing felt like magic wrapped in duct tape. Open browser, type question, get code, copy, paste, fix imports, repeat. Clunky? Absolutely. But it worked, and that was enough to make it feel revolutionary.

Then GitHub Copilot showed up in mid-2022, and suddenly we had AI living inside our editors. No more context switching. Just start typing and watch the ghost text appear. It felt like the future, until you actually used it for a while.

Early Copilot had real problems. It only used OpenAI Codex, with zero model selection until late 2024. It only saw your current file, no project-wide awareness, no understanding of your architecture, no context beyond the code you were staring at.

The quality wasn't great either. GitHub's own benchmarks showed 43% first-try accuracy, meaning it was wrong more often than it was right. An NYU study found that 40% of its generated code had security vulnerabilities. Not reassuring when you're shipping to production.

The autocomplete would pop up at the weirdest moments, fighting with IntelliSense, inserting code in the wrong place, or leaving you with mismatched brackets to clean up. It broke your flow more often than it helped.

But we kept using it because, well, what else was there?

Cursor changed things in 2023-2024. Tab completion that actually worked. Model selection so you could pick what you wanted. Then they acquired Supermaven and suddenly autocomplete was sub-500ms, fast enough that it felt like reading your mind instead of lagging behind your thoughts.

And then the model arms race kicked into high gear. Sonnet 3.5 was the breakout star, fast, capable, and good enough for most tasks. Sonnet 4 raised the bar even higher. Thinking models arrived and gave us a glimpse of what was possible when you let the AI actually reason through problems. And then Opus dropped, and everything changed.

Once you've used Opus for a complex refactor or a gnarly architectural decision, going back to Sonnet feels like trading in a sports car for a bicycle. It's not that Sonnet is bad, it's great for most things. But Opus just gets it in a way that other models don't.

The $500 Problem

The difference between Opus and Sonnet isn't subtle. When you're doing complex refactors, making architectural decisions, or debugging something that requires understanding six layers of abstraction, Opus is in a different league.

It's the difference between "here's a solution that might work" and "here's the solution, and here's why the three alternatives you're probably thinking about won't work in your specific case." But that quality comes at a price. A steep one.

Opus 4 costs $15 per million input tokens and $75 per million output tokens. Sonnet 4 costs $3 input and $15 output. That's a 5x price difference, and when you're using these tools all day, every day, it adds up fast.

My typical Sonnet months on Cursor's usage-based billing ran $100-150. Annoying but manageable. My first month using Opus heavily? $477 in 28 days. Luckily my client covered the bills, but I still felt guilty seeing that number.

I started thinking about tokens instead of code. Should I start a new chat to save context? Is this refactor worth the cost? Maybe there are tricks to reduce token consumption, but I didn't want to become an expert in prompt optimization. I wanted to code and not care about billing.

So I started looking for alternatives, and found Claude Code Max.

Claude Code Max: Pay $100 and Stop Worrying

Claude Code Max is Anthropic's answer to the usage-based billing problem. Pay $100 a month for Max 5x, or $200 for Max 20x, and you get full Opus access with no per-token charges. Just a flat rate and a generous usage cap.

The limits work on a rolling window system, which sounds complicated but is actually pretty simple in practice. You get a 5-hour rolling window for burst usage. It doesn't reset on a fixed schedule. Instead, it's based on when you started your session.

There's also a weekly active hours cap, where "active" means when Claude is processing tokens, not when you're reading output or thinking.

Max 5x gives you all-day Sonnet access and a generous Opus allocation. In practice, that's more than enough for full-time development. Personally, I haven't hit the hourly limit yet. If you do, it resets in a few hours. Not the worst excuse to step away and make sure you still remember how to code without AI.

The mental shift was immediate. I stopped thinking about money, optimizing prompts to save tokens, second-guessing whether a question was "worth it." I just worked.

And the math? ~$500 a month down to $100 is over $4,000 saved per year for essentially the same Opus access.

But switching tools isn't just about money. If Claude Code was worse than Cursor in every other way, the savings wouldn't matter. So let's break down what each tool actually does better.

What Cursor Still Does Better

I'm not here to tell you Cursor is bad. It's not. There are real things it does better than Claude Code, and if you're considering the switch, you should know what you're trading away.

Tab completion is the big one. Cursor's Fusion model combined with Supermaven is genuinely impressive. Sub-500ms response times with 13K token context windows. It doesn't just predict what you're about to type; it predicts where you're about to navigate next.

When you're making lots of small edits across multiple files, this alone might be worth staying on Cursor for. Claude Code doesn't even try to compete here.

Then there's the GUI advantage. Selecting files, folders, code snippets, terminal output — just click and add to chat. In Claude Code, you type @filename or specify line numbers manually. It's not hard, but it's not as smooth. When you're trying to quickly pull together context from six different files, point-and-click is faster than typing.

Diff review is another place where Cursor shines. You get inline red-green diffs with accept-reject buttons for each change. It's intuitive, and it makes reviewing AI-generated changes feel natural. Claude Code works in the terminal, which means you're either reviewing diffs in your editor or using git diff. You can get visual diffs if you use the Zed editor integration, but Zed has its own quirks (more on that later).

Multi-model fallback is a nice safety net. When Anthropic has an outage (rare, but it happens), Cursor lets you switch to GPT or Gemini and keep working. Claude Code is Claude only. If Anthropic is down, you're done.

Cursor has background agents that run in the cloud. You can kick off a complex task, switch to something else, and come back when it's done. Claude Code doesn't have anything like this yet.

I've also noticed Claude Code occasionally slowing down with large conversations. There are known issues where context units above 1,000 can cause typing delays (see issue #12222). Cursor seemed to handle big feature work more gracefully, though I haven't done enough side-by-side testing to say that definitively.

And finally, IDE integration depth. Cursor has been building on top of VS Code for years. Chrome DevTools integration, speech-to-text, polished GUI features. It's a mature product. Claude Code is newer and rougher around the edges.

What Claude Code Does Better

This might not matter to everyone, but I love that it's CLI-native. I'm in the terminal all day anyway. Cursor's chat panel takes up space inside your editor, competing with your code for screen real estate. Claude Code is just another terminal pane, same place you're already running builds, git, and tests. Reference files with @, drag and drop images straight in. It fits into the workflow you already have instead of adding a new one on top.

The CLAUDE.md system is where Claude Code really pulls ahead.

CLAUDE.md is persistent project memory. It loads every single session automatically, conventions, architecture, common commands, gotchas. Claude never forgets. You write it once, and every conversation starts with that context already loaded. No more explaining the same things over and over.

Custom slash commands let you define reusable actions for your project. Things like /fix for targeted debugging, /test-service to generate tests following your patterns, or /pr-review to check code against your team's standards.

Skills go a step further. They're knowledge packages that teach Claude how a specific part of your codebase works, your DAL patterns, your testing conventions, your component architecture. Commands tell Claude what to do. Skills teach Claude how to think about your code.

Then there are hooks. They let you automate actions based on events. For example, you can run your linter automatically after every edit, so issues get caught before you even look at the code. No manual step, no forgetting.

There's also an official GitHub Actions integration. Mention @claude in a PR comment and it reviews your code, suggests fixes, or creates commits following your project's CLAUDE.md standards. Tag it in an issue and it can create a branch with the implementation. Everything you've set up locally carries over to your CI/CD pipeline.

All of this is version controlled and shared with the team. Cursor has recently added its own rules, commands, and skills system. It's catching up, but Claude Code's ecosystem is more mature, with deeper hierarchies, hooks, and a larger community building on top of it. If you want to see what's possible, check out the official Claude Code documentation.

OpenCode & Oh-My-OpenCode

So far I've been comparing Cursor and Claude Code as they come out of the box. But there's an open-source ecosystem that supercharges the whole experience.

OpenCode is an open-source AI coding agent with over 93,000 GitHub stars. It supports Claude, GPT, Gemini, and local models, so you're not locked into one provider. Even if you only use Claude today, Oh-My-OpenCode makes it worth the switch.

Oh-My-OpenCode is an orchestration layer on top of OpenCode, basically Oh-My-Zsh for AI coding. Instead of one agent doing everything sequentially, it spins up specialized agents that work in parallel. One agent can research documentation in the background while another explores your codebase and the main agent implements the feature you asked for. All agents run in parallel and work independently.

Note: As of January 2026, Anthropic has been tightening restrictions on third-party tools using Claude subscriptions. OpenCode works for now, but you may need workarounds if things change.

A Note on Editors

Claude Code has a VS Code extension, but I found the raw terminal experience more reliable. Image uploading was buggy and the file picker was noticeably slower than the CLI. Not dealbreakers, but enough to make me stick with the terminal.

I also tried Zed with Claude integration. It's promising but not there yet. No persistent chat history for external agents (zed issue #37074), so conversations vanish when you close them. You can't run Claude's slash commands like /resume through Zed (zed issue #37719).

My recommendation? Just use Claude Code (with OpenCode and Oh-My-OpenCode) in a terminal pane alongside your editor. Simple, reliable, no integration bugs.

The Best of Both Worlds

Here's what I actually do: Cursor Pro ($20/month, usage-based billing turned OFF) plus Claude Code Max ($100/month). Total: $120/month.

I use Cursor for tab completion and quick inline edits. The Fusion/Supermaven model is still the best in the business for autocomplete, and on the Pro plan with usage-based billing disabled, it's unlimited. For small changes and rapid iteration, it's perfect.

I use Claude Code for everything else, through OpenCode with Oh-My-OpenCode. Multi-file refactors, architecture decisions, debugging, test generation. Oh-My-OpenCode spins up multiple agents working in parallel, so research, implementation, and verification happen at the same time instead of one after another.

There's no conflict. Cursor is the IDE, Claude Code is the brain. One handles the keystrokes, the other handles the thinking. They complement each other perfectly.

$120 a month for the best of both worlds versus ~$500 a month for Cursor with Opus alone. That's a 75% savings.

What's Next

There's more to Claude Code than I've covered here. The skills system goes deep, hooks and multi-agent workflows open up possibilities I'm still discovering, and CLAUDE.md keeps surprising me with how much it changes the daily workflow.

I'm thinking about writing a follow-up on getting the most out of Claude Code. How to structure your CLAUDE.md, how custom skills can automate your team's workflow, and how OpenCode with Oh-My-OpenCode can supercharge your setup. Let me know if that's something you'd find useful.

What's your setup? Still on Cursor? Tried Claude Code? Using something else entirely? I'd love to hear what's working for you.

Originally published on akrom.dev. For quick dev tips, join @akromdotdev on Telegram.