Forem: Dmitry (Dee) Kargaev

I was a half-builder

Dmitry (Dee) Kargaev — Thu, 07 May 2026 22:44:52 +0000

I have thirteen public repositories on GitHub.

Three of them are real products.

The rest are half-shipped: interesting starts, side-quests, idea-shaped objects with a README and a pushed_at date and not much past it. Universal-codemode: clean idea, two demos, no users I can name. Vasted: works on my machine, never advertised, never used by anyone who isn't me. Smart-spawn: model router, never wired into anything I run daily. Mcclaw: Mac LLM checker, fun side build, abandoned at v0.2. Moltedin: a marketplace I sketched and walked away from. Lobster-tools. Tldr-club. Clawbot-blog.

I built fast. I shipped half. I posted screenshots.

That's the dominant mode on AI-builder X right now and I want to write the post about it as someone caught inside it, not above it.

The Builder.ai version

The loud version of this is Builder.ai.

The pitch was an AI named Natasha that built apps from a single sentence. Microsoft believed it. SoftBank's DeepCore believed it. The Qatar Investment Authority believed it. About $450M of capital believed it.

Behind the AI: 700 human engineers in India and Eastern Europe.

By 2024 the investigations had landed. Bloomberg. WSJ. The Information. By May 2025 the company was filing for insolvency, Microsoft and the creditors were inside the building, and "Builder.ai" had become culture-wide shorthand for AI-washing. Strap "AI" to a labor product, raise nine figures, ride the cycle until the cycle catches up.

That's the loud version of the pattern.

The quiet version is on your X feed every day, and it's not committing fraud. It's people shipping the half they can ship and calling it the whole. That's what I've been doing.

What a half-builder actually is

Tighter than "doesn't ship":

A half-builder is an operator who can do exactly one half of design-to-deploy, then skips the other half by simply not showing it. They post the artifact for their good half. The bad half is implied to exist. It usually doesn't.

There are three failure modes and I've personally lived all three.

The designer who can't code. Posts the Figma. Posts the AI-generated mock. Posts the screenshot, the concept, the "what if I built this?" thread. Never posts the running URL. The "build" is a frame around an image. I did this for years before I learned to ship.

The coder who can't design. Posts the diff. Posts the gist. Posts the prompt. The thing technically runs but you wouldn't keep it open for more than a session. The interface is a textarea and a <details> tag in Helvetica. I've published a few of these too. I called them "tools."

The either who can't ship. The most common failure mode by an order of magnitude. They can do their half competently. They can't deploy it, can't keep it up, can't onboard a single user, can't reach week two. Six demos a month. Zero products. The artifact dies in a screenshot.

The third failure mode is the one I've spent the most time in. I'd build a thing in a weekend, push it to a public repo, post a screenshot, get a few likes, and move to the next thing on Monday. I called that "shipping." It wasn't. It was sketching in public.

In all three modes the AI is real. The thing posted is real. Something got built. What didn't happen was building the whole thing. The half that wasn't shown was fake, missing, on someone else's calendar, or a TODO that never got picked up again.

That's a half-builder.

Why half-building is the default

It's not a personal failure. It's the structure of the industry for twenty years.

Design and engineering have been culturally separated since the early-2000s web. You picked a side at 22. The side trained you. Designers learned visual systems, components, motion, brand. Engineers learned data structures, infra, deployment, latency budgets. The handoff was the deliverable. Each side optimized for being good at their half, because their half was the whole job.

AI is collapsing that gap.

Every tool that closes the design-to-code distance (Figma-to-code generators, coding assistants, no-code with escape hatches, full-stack agents) pays out to operators who hold both sides in one head. The premium isn't on either half anymore. It's on the seam.

Twenty years of single-side specialization don't unwind in a hype cycle.

So the dominant cohort on AI-builder X is exactly who you'd expect. People whose career was built around being competent at one half. Learning AI in real time. Posting the half they can already do. Hoping the AI bridges the rest.

Sometimes it does. Most of the time it doesn't. The shipped product never appears. The next thread does.

I've been on this side of the timeline for years. Designers who became "builders" the day GPT-4 dropped. Engineers who became "AI engineers" the day Cursor got good. I'm one of them. The honest answer is that AI made it embarrassingly easy to look like a whole-builder while staying a half-builder underneath.

Builder.ai was that, with a $450M check on top.

What I've actually shipped (and what I've half-shipped)

Here's the honest receipts list. Not the highlight reel.

Real products people use:

Dory. Shared memory layer for AI agents. Local-first, markdown source of truth, CLI / HTTP / MCP native. Open-source on GitHub, has actual users, gets actual issues filed. This is the only one I'd call run-grade.
deeflect.com. Personal site, in production, anchors my entity online.
blog.deeflect.com. Thirty-one published articles. Some of them are good. Not all of them are from this year, that was overstated in earlier drafts of this essay.
dee.agency. Solo studio site, productized AI work.
Don't Replace Me. Survival book on the AI apocalypse, paperback, hardcover, Kindle, on Amazon. Written end-to-end. People are reading it.
The SEO-to-GEO Gap. First research paper, accepted and posted on SSRN this month with a real DOI. First peer-review-adjacent credential I've ever earned.

Half-shipped:

ViBE. Twitter-based reception benchmark across 22 frontier AI model families, 2,965 judged mentions, $1.92 in judge cost. I love the writeup. I keep pitching the writeup. The benchmark itself is dogshit as a continuous product. It's a one-shot artifact, not a living thing, and treating it like a flagship was me confusing "interesting research" for "shipped product."
Universal-codemode. Two tools that replace hundreds. Clever. Not used.
Vasted. GPU-inference one-liner. Works. Unadopted.
Smart-spawn. Model router. Demo grade.
Castkit. CLI demo recorder in Rust. Cute. Sat down.
Mcclaw. Mac-LLM checker. Fun. Abandoned.
Moltedin / lobster-tools / tldr-club / clawbot-blog. Different shapes, same pattern. Started, posted, walked away.

The actual range underneath all of it:

Fifteen years of design. A cybersecurity bachelor. Firmware on ESP32 and marauder builds when the topic shifts. Designed for VALK across 70-plus financial institutions and 15 countries before walking out of that role earlier this year. Russian-born, lived across five-plus countries. ADHD wired enough to learn shit in a week and bored enough to walk away from it in a month.

The range is real. The shipping discipline isn't there yet.

In October 2025 I burned out and quit X for six months from a 200K-impressions-a-day peak. I'm reactivating from 640 followers as I write this. The list above is what got built around the crash year: three real products, a book, a paper, a personal entity I can point to, and a graveyard of clever half-things.

That's the honest picture. I'm a recovering half-builder.

The opposite cohort

The opposite of a half-builder is a whole-builder.

A whole-builder is one operator who covers design + code + AI + deploy + distribution end-to-end with no handoff. They pick fewer fights. They keep the artifacts alive past launch week. They have repos with users in the issue tracker, not just stars in the corner.

Pieter Levels is the canonical example. Design, code, deploy, distribute, monetize, all solo, all in public, receipts measured in MRR and screenshots. Marc Lou ships products with full visual identity attached. Theo runs an entire product line out of what he can hold in one head.

These aren't unicorns. They're the rarer category: operators who didn't pick a side and built their working pattern around not having a handoff. They're also the operators who said no to the next side-quest and kept the last one running.

I've copied the breadth half of that pattern. I haven't copied the discipline half. Whole-building isn't about doing more. It's about doing fewer things further. That part I'm still learning.

How to spot a half-builder (mirror included)

Most "AI builders" on the internet right now are half-builders, and most of us know which side we're on if we're honest about it.

The test is mechanical. It costs nothing. Run it on every "AI builder" account in your timeline this week, and on yourself.

Ask for the running URL. Not the prompt. Not the screenshot. Not the demo video. The URL someone else can open right now, on their phone, with no auth, no waitlist. If they can't produce one, you're talking to half a builder.

Ask for the repo. Public repo, last commit recent enough to matter, an issue tracker that isn't a ghost town. If "the code is private", fine. Ask for the deployed product. If neither exists, you have your answer.

Ask what they shipped this month. Not last year. Not "in their career." This month. Half-builders ship demos. Whole-builders ship products that someone else is using on a Tuesday morning.

If you ran that on me a month ago, you'd hear about ViBE and a clever Rust thing and a model router and a half-finished benchmark and a launch I almost did. You'd hear about everything except a product someone else opened on a Tuesday. The honest answer would have been Dory, and maybe the blog, and the rest is noise.

Show the repo or sit down, including the one I'm pointing back at when I write that.

Stopping

The exit from being a half-builder is mechanical, not mystical.

Pick the half you can't do and start doing it badly until you can do it. Designers shipping their first deploy. Coders learning visual hierarchy. Either learning distribution. The half you can't do isn't a personality. It's a backlog.

Pick fewer things. Keep them alive past the first week. Treat "shipped" as "someone else used it on a Tuesday," not "pushed to GitHub on a Sunday."

Whole-building is a slow accumulation of the second half by the first, until the seam disappears. None of that happens in a single weekend.

This essay is the first move. The next moves are: Dory gets the maintenance it deserves. ViBE either becomes a continuously-updating thing or gets retired honestly as a one-shot paper, not pretended into a flagship. The agency stops being a placeholder. The next side-quest waits its turn, or doesn't get started.

I'm writing this with the same uncertainty most of you feel scrolling past it. Am I the half-builder? Probably. What does the turn look like? Like this.

Build the whole thing.

Ship the running URL.

Show the repo.

Or sit down, including me.

That's the post.

Sources for the Builder.ai facts: Bloomberg's investigation into the company's engineering operations (2024), the Wall Street Journal's coverage of the May 2025 insolvency, and *The Information's reporting on the human-engineer back-end. Public, well-indexed; current URLs available via search.*

I've Touched Everything and Mastered Nothing

Dmitry (Dee) Kargaev — Mon, 06 Apr 2026 20:28:35 +0000

Seventeen years. That's how long ADHD has been making me touch every skill, hobby, and career path that crossed my path. I'm 30 now. I've lived in eight countries, built products in five programming languages, shipped code on four blockchains, released music on Spotify under an artist name I genuinely cannot remember, and learned enough Vietnamese to haggle at a market in Nha Trang.

I am not world-class at any of it.

That's the honest version of this story. Not the LinkedIn version where "my diverse background gives me a unique perspective." The real version, where I've spent a decade and a half chasing dopamine across every domain imaginable and I'm only now figuring out what that actually means for a career, an identity, and a life.

What the ADHD cycle actually looks like across every skill, hobby, and career path

The cycle runs on roughly a two-week clock. Something new appears - on Twitter, in a YouTube rabbit hole, in a Discord I shouldn't be in at 2am. The dopamine hits immediately, before I've done anything. I'm already planning the next phase while I'm still in the first hour.

Then comes the manic stretch. Twelve-hour sessions at the computer, deep in documentation and GitHub repos and forum threads from 2015 that nobody else has read. I still eat well, still sleep well, still train. Health is the one thing I never let slide, no matter how deep the rabbit hole goes. But everything else disappears.

Two weeks later, sometimes less, I wake up and it's gone. Not burnout. Burnout has texture - exhaustion, resentment, a desire to rest and come back. This is nothing. Flat affect toward something that consumed me completely three days ago. The interest didn't fade. It evaporated.

This has been my entire adult life. Before that, too - graffiti, parkour, long-distance running, acrobatics, all before I owned a computer. I fixed a relative's MS-DOS machine in English when I was around eight and barely spoke English. I rigged our home phone line to connect to a friend's LAN across the city. I built a radio to intercept our wireless home phone so I could eavesdrop on my mom's calls.

I was never bored. I was always building something I'd abandon.

What I didn't understand at eight, and only started to understand around 27, is that this isn't a moral failing. It's the shape of how my brain processes novelty. The dopamine system in ADHD brains responds to new stimuli harder and drops off faster than neurotypical brains. I wasn't undisciplined. I was running a biological process I had no name for.

Knowing that doesn't stop the cycle. But it changes how you relate to the wreckage.

The graveyard is real

Let me just list some of it.

A DJI drone I used four times. A Sony DSLR that mostly lives in a bag. A 3D printer I operated twice, let collect dust for six months, and sold at a loss. A soldering station I learned to use at university, felt was essential to own again years later, and which currently sits in my wardrobe untouched. A ukulele from Turkey. Guitar - I can play a few songs. Piano - I can play Kanye's Runaway and that's it. Harmonica. Stylophone. Otamatone. An AKAI MPK Mini. Ableton. AI music with Suno. I released lofi tracks on Spotify. I cannot remember the artist name.

Stocks in Russia via an app. Mostly broke even, maybe lost a bit. Cybersecurity hardware - a Flipper Zero, a Kali Linux laptop, ESP32 boards with Marauder firmware, an ESB dongle for intercepting TPMS sensors, and an M5Stack kit with a pile of controllers and sensors. Cardputer. LLM630. Tiny screens, weird modules, more little boards than I had any reason to own. I spent weeks flashing firmware and scanning radio frequencies. Did I build anything useful? No. Make money? No. But I understood how your wireless water meter broadcasts unencrypted data to anyone with the right hardware, and that felt worth it.

None of this is sustainable. The "ADHD is a superpower" content you see everywhere stops before this part. The graveyard. The money spent. The projects half-finished. The domains where I got to "good enough to be dangerous" and never further, because the dopamine was already somewhere else.

This pattern has been running long enough that it stopped feeling surprising a while ago.

The financial reality nobody mentions

I want to be concrete about this because most "ADHD and creativity" content is vague about costs.

The 3D printer was around $400. Sold for $180. The drone was $700. Used it maybe ten hours total. The cybersecurity hardware - Flipper, M5Stack, the various ESP32 boards, cables, accessories, random modules - probably $600 spread across three months. The Ableton license. The AKAI. The ukulele I bought in a market in Istanbul because I was in a hyperfocus phase about lo-fi music production.

I don't have an exact number. But it's real money. And this doesn't count the opportunity cost of the hours - hundreds of hours flashing firmware on microcontrollers that never produced anything, learning guitar in 30-minute sessions spread across five years, studying Vietnamese for six weeks before the next interest hit.

I'm not saying this to make it sound worse than it is. I'm saying it because the "embrace your ADHD divergent thinking" content leaves this part out and I think it's dishonest. The breadth has real costs. The search has a price.

When it does stick

Some things stuck. Design, 17 years. The gym, 15+ years. AI, three years and accelerating.

I started freelancing at 14, selling forum banners and GIFs on ICQ. By university I was building websites for roughly $80-120 each in local-currency terms at the time. When I graduated, companies were offering something like $150 a month. I was already making more than that per site. I said no thanks and kept going. Design is probably the closest thing I have to real expertise after 17 years - I spent five of those years as the solo senior product designer at VALK, a fintech platform that moved $4B+ in deals across 70+ financial institutions in 15 countries. Awards. Real scale. Products that live in actual banks.

The gym stuck because I started before I knew what consistency meant and just never stopped. Different approaches over the years - bodybuilding, strength, cuts, boxing for two years in Krasnodar - but the baseline never dropped. I track bloodwork every few months and stay on top of health like it's another system to tune. When something becomes infrastructure instead of a project, the ADHD can't kill it.

AI stuck because it's the first domain that feeds the obsession cycle faster than the cycle drains it. Every week there's something genuinely new. You can't get bored because the field won't hold still. I built one of the first agentic loops I'd seen anyone build in 2022 - a Telegram bot for a crypto community that could reason and take actions, before "agentic" was even a term. I didn't know what I was building. I just thought it was interesting.

Now I let AI maintain my second brain - knowledge, reminders, loose thoughts, follow-ups, personal assistant type shit. Less "look at my agent stack," more "I built something that remembers what I forget and keeps my life from scattering." I also host local models on a Mac Mini because of course that became another obsession too - if something can run on my own box, I want to try it. It's not a demo. It runs every day. You can read more about how I approach coding and multi-model workflows if you want the technical side.

Why some things survive the cycle

I've thought about this a lot. What makes design and the gym persist when everything else evaporates?

Two patterns show up every time.

First: external feedback loops that don't require internal motivation. The gym gives me bloodwork numbers, strength PRs, photos over time. Design gives me client feedback, shipped products, metrics. When my internal interest flags, there's still external data pulling me back. The drone had none of that. The guitar had none of that. They only worked when I was actively excited, and when the excitement went, nothing remained.

Second: the domain kept changing fast enough to feed new obsession cycles. Design went from print to web to mobile to design systems to AI-generated UI. Every three years there was a new layer to get obsessed about. The gym went from machines to compound lifts to programming to bloodwork optimization. The domain regenerated novelty before I burned through it.

AI has both properties at a ridiculous level. New model every month. New architectural pattern every six weeks. New tool category every quarter. It's basically a purpose-built trap for an ADHD brain and I walked straight into it.

How ADHD shapes every skill, hobby, and career path you choose

Here's what I've realized after 30 years of this: ADHD doesn't just affect how you learn things. It determines what career paths even become available to you.

I have an Information Security bachelor's that I haven't used professionally once. But the two years I spent in that program gave me networking fundamentals, an understanding of cryptography primitives, and enough systems-level thinking that when blockchain showed up in my life it wasn't foreign territory. That "useless" degree became the reason I could evaluate Solidity code for security issues at VALK without being a dedicated security engineer.

The career I've actually had - designer, then product lead, then AI engineer - looks like three different careers. But they're the same brain solving the same problem at different layers of abstraction. Design is about modeling user mental states. Product is about modeling system interactions. AI engineering is about modeling agent behavior. I didn't pivot three times. I drilled down.

That's the ADHD career path nobody maps: not a straight line, not even a zigzag, but a spiral. You come back around to the same core problems from different angles. The angle changes. The problem doesn't.

The generalist discount problem

No job posting says "wanted, someone who can do a little bit of everything."

Generalists get discounted by hiring processes designed for specialists. I have production code in TypeScript, Python, Rust, Solidity, and SQL. Deployed products on four blockchains. Sysadmin experience across every OS since childhood. Enough crypto experience to have launched 20+ tokens. On a resume this looks scattered. In a specific situation - say, a 48-hour sprint where a client needs a smart contract, a minting frontend, and custom illustrations - it's exactly what's needed and nobody else in the room has all three.

When VALK wanted a Christmas NFT campaign I told them I'd handle it. Smart contract, minting site, illustrations, the whole frontend - solo, in one sprint. That's not something a specialist does. That's an ADHD brain that collected 12 different surface-level skills over a decade, all converging on one afternoon's work.

The problem is you can't interview for that. "I know a little about a lot" doesn't clear an ATS. So the career path for someone like me had to go around traditional hiring - freelance, founding roles, solo building. Places where the breadth shows up in delivered work rather than credentials.

The breadth problem and the breadth premium

This year I wrote a 33,000-word book from scratch. I'd never written a book before. Don't Replace Me: A Survival Guide to the AI Apocalypse - 24 chapters, formatted, cover designed, published on KDP, audiobook via ElevenLabs, SEO landing page with schema markup, Amazon ads. Not because I wanted to be an author. Because the process of building the whole machine was interesting to me.

The ADHD didn't stop me from finishing a book. It kept me engaged long enough to finish one because there were 15 different new processes to learn inside the single project. The writing was interesting for the first 10,000 words. Then the Kindle formatting was interesting. Then the audiobook pipeline. Then the schema markup. By the time I finished, I had a complete book, a production process, and had learned four skills I didn't have before.

This is the pattern. I don't go deep on one thing. I go wide enough that when a project needs five different skills, I'm the one person in the room who can cover them all. The breadth is a premium in those moments - genuine leverage that a specialist can't replicate without a team.

The honest version though: those moments don't come every day. Most days, the breadth just means I know enough to be frustrated by problems in domains where I don't know enough to solve them.

Why the AI age changes this calculation for ADHD builders

AI is doing something weird to the generalist problem. On one side, everyone's a generalist now. Vibe coding means your non-technical friend can ship a landing page. The moat of "I know five programming languages" is basically gone.

On the other side - and this is the part I actually believe - breadth becomes more valuable when AI amplifies execution. You don't need to be an expert Rust developer. You need enough Rust to direct an AI agent doing Rust work. You need to know when the output is wrong. You need the taste to recognize what "good" looks like without being able to produce it from scratch at 100 words per minute.

I spent 17 years accumulating surface-level knowledge across maybe 30 domains. That knowledge doesn't help me compete with a specialist in any one of them. But it means I can look at an AI's output in almost any domain and tell you whether it's right. Design intuition applied to code review. Crypto chaos tolerance applied to agentic system failures. Sysadmin muscle memory applied to Docker containers. The ADHD brain that couldn't go deep now has a use case that rewards breadth.

This isn't hypothetical. I use dee.ink - a collection of 31 Rust CLI tools I've built for AI agent workflows - as a practical example. Building those tools required Rust, CLI design, documentation, packaging, and enough understanding of AI agent needs to spec useful primitives. Same with the home lab stuff - a ZimaBoard running home server experiments, local infra, and smart home services because once I touched self-hosting I obviously had to touch that too. A specialist Rust engineer would write better Rust. A specialist AI engineer would understand agent needs more deeply. A specialist infra person would build a cleaner home lab. But I could build the whole thing myself, end to end, without waiting on anyone.

That's the AI-era argument for the ADHD generalist: you're not competing on depth anymore. You're competing on range of judgment. And AI is making range of judgment the bottleneck, not depth of execution.

If you've felt this same tension - the generalist guilt, the half-finished projects, the identity question of what you even "are" professionally - I wrote more about navigating ADHD and AI as actual compensation tools, not the productivity-porn version.

The identity question ADHD every skill, hobby, and career path creates

Here's the thing nobody talks about with the ADHD generalist lifestyle: you don't know what you are.

Ask a specialist what they do and they tell you in one sentence. "I'm a backend engineer." "I'm a product designer." The identity is clean. The work is legible to other people.

Ask me and I have to make a choice about which version of myself I'm presenting. The designer with 17 years? The AI engineer building multi-agent systems? The guy who spent three months obsessively learning about RF signal interception for no professional reason? All of these are equally true. None of them is the whole answer.

For a long time this felt like a problem. I'd look at people with clear professional identities - engineers who'd been doing one thing for ten years, designers with a coherent portfolio narrative - and feel like I was faking it. Like my breadth was evidence of some underlying lack of commitment.

What I've landed on at 30 is different. The identity question isn't a problem to solve. It's a feature of a specific type of brain operating at full capacity. The discomfort of not fitting a category is the cost of not being constrained by one.

I'm not a designer who learned to code. I'm not an engineer with design skills. I'm something that doesn't have a clean job title yet, that probably couldn't exist before AI made it possible to execute across domains without a full team. That's not a failure of self-definition. It's just early.

What actually sticks at 30

The gym. Design. AI.

And maybe that's the real pattern. The things that stuck aren't the things I chose. They're the things that were still there after the dopamine moved on. The gym was still there because I'd been going long enough it became automatic. Design was still there because clients kept paying me. AI is still there because it keeps generating new problems faster than I run out of interest.

What I've stopped doing is chasing the feeling of the early phase - that first-week intensity when everything seems possible and you're learning at maximum speed. That feeling always ends. The question isn't how to keep the feeling. It's what you're building during it.

What I actually think about it at 30

The "ADHD superpower" narrative is mostly incomplete. It's real but it stops too early. Yes, the hyperfocus is incredible. Yes, the breadth compounds in ways specialists can't replicate. Yes, the manic phases produce more in two weeks than most people produce in two months.

But the depressive phases are the tax. The days where you look at a project you were obsessed with last week and feel nothing at all. Not tired. Not frustrated. Nothing. The graveyard of equipment bought and abandoned is real money. The projects with twelve active tabs and zero generating revenue are real.

I've been through every single phase of this across eight countries and more career pivots than I can accurately count. I still think I'll find the thing. Maybe design and AI are already it and I just haven't accepted that yet. Maybe something I haven't encountered will show up and replace everything I've built my identity around. Both feel equally plausible from the inside.

What I know is this: at 30, having touched more domains than most people touch in a lifetime, being functionally average at most of them and arguably expert at two - I'm not embarrassed by any of it. The search wasn't failure. The search was the work. Everything compounds in ways you can't predict when you're in the middle of a two-week obsession with intercepting radio signals from water meters.

If you want to see what the current obsession looks like in practice - the AI engineering side, the multi-agent systems, the actual tools - the about page has the full context. And if you're building something similarly scattered and solo, the tags page will probably surface something relevant.

If you're in the middle of your own version of this - the cycle, the graveyard, the guilt about the 3D printer sitting in your closet - you're probably not broken. You're probably just still searching.

That's an okay place to be.

So I asked AI to list my skills

Based on everything I've touched, built, learned, half-learned, abandoned, revived, bought, sold, shipped, and somehow got paid for, I asked AI to write out my skill set in one paragraph.

It was a mistake.

Graphic design, product design, UI design, UX design, interaction design, interface design, web design, mobile design, dashboard design, platform design, application design, systems design, design systems, visual systems, component systems, branding, digital branding, visual identity, typography, layout, hierarchy, composition, spacing, iconography, illustration, digital illustration, vector illustration, marketing design, motion graphics, banner design, avatar design, forum graphics, landing page design, cover design, presentation design, pitch deck design, onboarding design, checkout flow design, conversion design, user flows, journey mapping, wireframing, prototyping, information architecture, UX strategy, usability thinking, interface critique, product thinking, product strategy, feature prioritization, product packaging, product positioning, product communication, client communication, stakeholder communication, design leadership, solo product ownership, freelance design, agency design, startup design, enterprise product design, fintech product design, dashboard UX, enterprise workflows, white-label platform design, workflow design, complexity reduction, visual clarity, frontend design, frontend implementation, HTML, CSS, responsive design, JavaScript, TypeScript, React, component libraries, design-to-code translation, web app implementation, rapid prototyping, interface implementation, code editing, debugging, code reading, AI-assisted coding, product-minded engineering, scripting, Python, Rust, SQL, Solidity, command-line tooling, CLI product thinking, developer tooling, automation scripting, API integration, webhook logic, backend glue code, database basics, schema instincts, debugging AI output, debugging code, debugging workflow failures, prompt engineering, prompt iteration, prompt structure, context design, tool calling, agent orchestration, multi-agent workflows, AI workflow design, AI product design, AI UX, AI-assisted writing, AI-assisted coding, AI-assisted research, AI tool evaluation, model comparison, reasoning model usage, local model usage, cloud model usage, local LLM setup, memory systems, second-brain systems, reminder systems, retrieval systems, RAG, embeddings, semantic search, vector search, AI assistant design, assistant workflow design, personal AI systems, research pipelines, writing pipelines, content pipelines, synthesis pipelines, information capture, note systems, knowledge systems, crypto product design, Web3 product design, smart contracts, Solidity workflows, token launch mechanics, NFT launch mechanics, minting flows, DeFi UX, blockchain UX, wallet UX, crypto campaign execution, crypto community operations, crypto marketing, launch coordination, presale mechanics, token website building, contract deployment understanding, onchain product instincts, community bot building, Telegram bot building, automation design, workflow automation, n8n-style orchestration thinking, research automation, content automation, digital marketing, internet marketing, social media growth, Instagram page growth, landing page copy, copywriting, headline writing, article writing, blog writing, long-form writing, editing, rewriting, draft development, AI-draft cleanup, humanization, publishing workflows, book writing, book formatting, self-publishing, KDP publishing, metadata writing, SEO, GEO, search intent mapping, keyword targeting, internal linking instincts, schema markup thinking, authority building, distribution strategy, launch strategy, publishing systems, website management, domain setup, CMS-light publishing, self-hosting, home server experimentation, Docker, DNS, reverse proxy basics, infrastructure curiosity, service setup, local infra experimentation, smart home experimentation, device setup, system setup, macOS setup, Windows setup, Linux setup, terminal usage, shell comfort, firmware flashing, hardware experimentation, embedded-device tinkering, cybersecurity basics, information security fundamentals, cryptography fundamentals, systems thinking, network instincts, radio experimentation, wireless experimentation, sensor experimentation, hardware debugging, hardware setup, game server hosting, home PC hosting, monetization instincts, digital hustle instincts, app install arbitrage, e-commerce experimentation, dropshipping experimentation, pricing instincts, sales instincts, client acquisition instincts, offer shaping, agency operations, productized service instincts, open source contribution, release management, tool publishing, music experimentation, music production, AI music workflows, Ableton experimentation, audio arrangement instincts, basic piano, basic guitar, basic ukulele, basic harmonica, stylophone experimentation, otamatone experimentation, creative direction, aesthetic judgment, visual taste, naming instincts, concept development, trend detection, trend synthesis, pattern recognition, fast learning, context switching, parallel execution, obsessive research, rabbit-hole depth, ambiguity tolerance, pressure-driven shipping, solo building, independent execution, figuring things out with incomplete information, reverse engineering workflows, surviving bad documentation, adapting to broken tools, evaluating software fast, comparing tools quickly, stack assembly, stack migration, no-code experimentation, low-code experimentation, API-first thinking, browser tooling, workflow compression, research summarization, synthesis, memory capture, memory retrieval instincts, file organization attempts, chaos-tolerant organization, async collaboration, self-direction, self-teaching, self-reinvention, internet-native communication, pseudonymous building, online identity experimentation, public writing, authority building through shipping, cross-domain thinking, interdisciplinary synthesis, technical taste, product taste, marketing taste, creative taste, execution bias, strategic intuition, quality smell detection, visual QA, copy QA, product QA, issue isolation, error triage, launch QA, software evaluation, tooling adoption, rollout instincts, packaging instincts, distribution instincts, positioning instincts, and generally becoming competent enough to start, ship, fix, relaunch, and repurpose work across an unreasonable number of domains without ever sitting still long enough to make any of it feel normal.

Reading it felt less like a skills list and more like a forensic report.

I Published a Book. Here's Why.

Dmitry (Dee) Kargaev — Mon, 30 Mar 2026 01:20:43 +0000

I wrote a book. It's called Don't Replace Me: A Survival Guide to the AI Apocalypse, and it's available right now on Amazon in Kindle, paperback, and hardcover. The announcement for Don't Replace Me survival guide dropped March 28, 2026 - press release picked up through AP News and everything. 235 pages, 24 rules, 33,000 words.

Took me way longer than I expected and was nothing like I expected. Let me explain what it actually is and how it got made.

Why Don't Replace Me exists as a book and not just tweets

I've been getting a version of the same conversation for two years. Someone finds out what I do - AI engineering, building agents, 15 years in product design - and they get this look. Half curiosity, half dread. Then the question: "So... should I be worried? About my job?"

The honest answer is complicated. Not "no, you're fine" and not "yes, start learning Python immediately." The actual answer depends on what you do, how you do it, and whether you understand what AI is actually replacing versus what it's just augmenting.

The loudest voices on this topic tend to be at the extremes. Either AI builders who are close enough to the technology that the disruption looks like opportunity from where they're standing. Or people far from it who are doom-scrolling tech Twitter and convinced everything is gone. Both are wrong in ways that matter to normal people with real jobs.

I've sat on both sides of that divide. I spent five years designing products at VALK - financial infrastructure for 70+ banks across 15 countries, $4B+ in deals running through platforms I helped design. I was building for institutions. Then I pivoted to building the automation itself. Multi-agent systems, AI workflows, the actual pipelines that companies are now using to cut headcount or redeploy people.

I've talked to people on both ends. Executives excited about efficiency. Employees scared about what "efficiency" means for them personally. And I kept having the same thought: someone should write something honest about this. Not a hype piece, not a manifesto, not a beginner's guide to ChatGPT. Something from the middle.

So I did.

What Don't Replace Me is and what it isn't

Don't Replace Me is not a coding tutorial. It's not a prompt engineering guide. There are plenty of those and they're mostly aimed at people who already know what a terminal is.

This is for my mom. For my friend who does project management at a mid-size company and genuinely doesn't know if her job will exist in three years. For the copywriter, the customer service manager, the paralegal, the graphic designer who keeps seeing LinkedIn posts that make them feel like they missed the boat.

24 rules. Each one practical, specific, and grounded in things I've actually seen. Not "embrace AI" or "upskill your journey" or whatever LinkedIn thought leadership nonsense. Real career advice for people navigating a real transition.

The companion site is at dontreplace.me and it has a free AI threat assessment quiz. You answer questions about your actual job - what you do day to day, how structured the work is, how much it involves judgment vs.execution - and you get a risk score. Not fearmongering, not false reassurance. Just a calibrated read on where you actually stand.

Three formats available now: Kindle, paperback, hardcover. Audiobook is in production. If you've been waiting for audio, it's coming.

ASIN on Amazon is B0GTX4J124. Paperback ISBN is 9798253164594. Hardcover is 9798253165386. I'm mentioning these in case you're the kind of person who looks these things up. You can also check out the book's website at dontreplace.me for more info and the quiz.

How I made this and why I'm being upfront about it

I used AI to write this book. I'm not hiding that, and I'm not embarrassed about it. But it's not what people assume when they hear it.

I didn't type "write me a 235-page career guide" and hit enter. That's not how this worked.

What I did: I spent months collecting actual opinions. Notes from real conversations. Observations from years of watching companies get disrupted from the inside. Things that frustrated me. Things I genuinely believe. Specific examples from fintech, from AI engineering, from watching product design go through its own automation panic. I had a lot to say - just scattered across voice memos, Notion drafts, Twitter threads, and conversations I kept having with people who were scared.

All of that went into the process as fuel. The AI was the engine. The substance is mine.

The result reads like me because it is me. My takes, my experience, my perspective on what's changing and what isn't. I used every tool available to get it out of my head and into something usable. Which is exactly what the book tells readers to do - use AI as leverage, not as a replacement for your own thinking.

The irony is the point. I couldn't have written a better proof of concept.

And honestly, if you're going to argue that this makes the book less valid - that's a conversation worth having, and it's one of the conversations the book is designed to prompt. Because that reflex, the instinct to devalue work because AI was involved, is exactly the kind of thinking that will hurt people's careers over the next decade. The question isn't "did a human do every part of this." The question is: is the thinking real? Is the perspective genuine? Is it useful?

Yes, yes, and I think so.

What 15 years of watching tech cycles taught me about this one

I started freelancing at 14. I've watched design go through "designers will be replaced by templates." I watched developers go through "no-code will replace engineers." I watched writers go through "content mills will replace real writing." Some of that disruption was real. A lot of the fear was misdirected.

This cycle is bigger. I won't pretend otherwise. The things AI is genuinely good at - pattern recognition, synthesis, first-draft generation, handling repetitive decisions - overlap more directly with white-collar knowledge work than previous automation waves did. This isn't just factory floors and truck drivers. It's reaching into offices.

But the shape of the threat is different for different people. And most of the advice out there treats it like a monolith. "Learn to prompt" is not useful advice for a nurse. "AI can't replace human connection" is not useful advice for a data entry specialist.

What actually helps is specificity. Understanding which parts of your work are high-judgment versus low-judgment. Understanding where your industry is in the adoption curve. Understanding the difference between "AI will do this task" and "AI will make someone else more efficient at your job." The second one is actually the bigger risk and people underestimate it.

That's what the 24 rules get into. Specific, not vague. Grounded in real patterns I've watched play out, not hypotheticals.

If you want a preview of where my head is at on some of this, I've written about leaving fintech to build AI systems, about what ADHD and AI actually look like together, and about why most AI products have terrible UX - the gap between what builders understand and what real users need. The book is longer, more structured, and aimed at a wider audience, but those posts give you a sense of how I think.

Who Don't Replace Me is actually for

Specifically, concretely:

If you have a job and you're not sure whether to be worried, this is for you. Not to soothe you with "AI can't replace humans" and not to panic you into a six-month bootcamp. To help you actually assess your situation.

If you manage people and you're trying to figure out how to talk to your team about AI adoption without sounding like either a corporate automaton or someone who's out of touch, there's stuff in here for you.

If you're in a creative field - design, writing, marketing - and you've already felt the job market shift, there's a section of this that I think is genuinely useful. Not "here's how to compete with AI" but here's how to think about what you're actually selling.

If you already know about AI and want something to send to a family member who keeps calling you asking what to do - this is that thing.

The technical people are not the audience here. My coding stack post or the deep dive on debugging AI agents is more your speed. This book is for the people those systems are being built around.

Where to get it

Amazon has all three formats right now — grab it here or search "Don't Replace Me Dmitrii Kargaev."

The companion site is dontreplace.me - free quiz, takes about five minutes, gives you a real threat assessment based on your actual role. Even if you don't buy the book, the quiz is worth doing just to have a concrete picture instead of ambient dread.

Audiobook is in production. I'll announce it here and on my portfolio site when it drops.

If you read it and have thoughts - what landed, what you disagree with, who you think needs to read it - I want to hear that. Building in public means the feedback loop stays open. This isn't the end of the conversation, it's more like a long, structured version of it.

The people I had in mind when I was writing this are real. If you know someone who's been quietly scared about this stuff, send it to them. That's what it's there for.

SEO Is Dead? No. But the Game Changed.

Dmitry (Dee) Kargaev — Thu, 26 Mar 2026 00:00:00 +0000

I asked ChatGPT who I am. It had nothing. No idea I existed.

I've been deep in the AI space for a while now. I spent 5 years as lead product designer at a fintech platform serving 70+ financial institutions. I shipped 31 open-source Rust CLI tools. Published 13 blog posts. Built in public for months. And the models I use every day had zero record of me.

That bothered me more than it probably should have. But it also made sense once I started pulling the thread. Because "SEO is dead" isn't quite right - but something real is shifting, and I wasn't ready for it.

This is what I found.

The two games nobody told me I was playing

There's Google-findable. And there's AI-citable. I had one. I completely lacked the other.

Traditional SEO is about ranking. You write content, build links, optimize your pages, and Google surfaces you when someone searches. The pipeline is: search query → ranked results → human clicks through. That's the game most people know how to play.

Generative Engine Optimization - GEO - is about citation. AI generates an answer. Your content, your name, your entity gets referenced in that answer. The pipeline skips the click entirely. There's no blue link. The model just knows you exist, or it doesn't.

I had spent zero time thinking about the second one. Which meant I was completely invisible to the systems I use to do my actual work. The irony was genuinely annoying.

That gap is what sent me down a multi-week research rabbit hole that ended with an open-source platform list, a scoring system, and a website full of free tools. But I'm getting ahead of myself.

What's actually happening to search right now - and why "SEO is dead" keeps trending

The data isn't speculative anymore. BrightEdge tracked AI Overviews appearing in 11% of all queries in 2025. CTR is down 30%. People are asking AI assistants instead of searching, and when they do search, they're increasingly getting an AI-generated summary instead of ten blue links.

This isn't a prediction. It's already the baseline. The shift isn't coming - it happened while everyone was arguing about whether it would.

SparkToro's 2025 data puts this in concrete terms: top established brands appear in 55-77% of relevant AI responses. Unknown entities? 70x more volatile. You either have consistent presence or you have noise. There's not much middle ground.

The question stopped being "how do I rank?" and became "how do I get cited?"

What made this hit differently for me was trying to find myself across the major AI systems. Not just ChatGPT. I asked Claude, asked Perplexity, asked Google's AI Overview. The results were inconsistent in a way that told me something real: these systems aren't pulling from the same data, aren't weighting the same signals, and aren't resolving entities the same way. Being findable on one doesn't mean you're findable on all. That fragmentation is the part nobody's really mapped yet - including most of the GEO content I've seen so far.

SEO isn't dead - let's be honest about that

The title is provocative on purpose. Here's the actual nuance, because I think people deserve a straight answer rather than a hot take.

seoClarity's 2025 research found that 99.5% of AI Overview sources come from Google's top 10 organic results. Read that again. The AI is pulling from Google's rankings. Which means if you don't rank, you don't get cited. SEO is still the prerequisite. GEO builds on top of it.

So no, you shouldn't burn your SEO playbook. You should add a chapter.

What changed is the goal. Ranking #1 used to be the finish line. Now ranking #1 gets you in the candidate pool for AI citation. That's still worth doing. But it's not sufficient anymore. You can rank well and still be invisible to AI systems if you're missing the signals that models use to identify credible, citable entities.

Think about what that means practically. You could have a page ranking on the first page of Google for a competitive keyword - real traffic, real impressions - and still not get cited in an AI-generated answer for the same query. Because the ranking signals and the citation signals overlap but aren't identical. You need both. And most SEO workflows are only optimizing for one.

That's the shift. Not death. Evolution with a second layer that most people haven't started thinking about yet - including me, until very recently.

What GEO actually is and why the research convinced me

Generative Engine Optimization is the practice of optimizing your online presence to appear in AI-generated answers, not just search rankings.

The foundational paper here is Aggarwal et al., published at KDD '24 - you can read it on arXiv. They tested different optimization strategies across a range of queries and found that the right approach could drive up to a 40% increase in AI visibility. The top strategies weren't what I expected: citations and statistics outperformed most other approaches. Authoritative sourcing matters enormously to how models evaluate content.

Structured data, clear entity signals, and demonstrable expertise all feed into whether a model considers you citable. This isn't link juice. It's closer to reputation infrastructure - the stuff that makes a model "trust" that you're a real entity with real credentials.

What clicked for me is that this maps to how the models actually work. They don't crawl the web in real time. They learned from a corpus. And in that corpus, some entities are clearly defined, well-referenced, consistently mentioned. Others are noise. I was noise.

The content structure piece most people miss

The Aggarwal research gets into something most GEO content glosses over: it's not just what you publish, it's how you structure it.

Content that AI systems cite tends to have specific characteristics. It makes direct, falsifiable claims. It cites external sources. It includes statistics with attribution. It answers questions in a way that can be cleanly excerpted. This last point matters more than I initially thought - AI systems aren't summarizing your whole article, they're pulling specific passages. If your content isn't written in citable chunks, it's harder for a model to quote you cleanly even when it wants to.

This is actually a content design problem as much as an SEO problem. It's related to something I've written about before in the AI UX space - most people building for AI systems haven't thought about the machine as a reader with specific needs. The machine reads differently than a human does. It's looking for density, structure, and attributable claims.

Writing for AI citation means writing in a way that makes a model's job easier. Short, precise statements. Named sources. Numbers with context. The exact opposite of the fluffy "this is interesting to explore" writing style that pads word counts but says nothing.

The data point that actually rewired how I think about this

I went deep on the Ahrefs research and one number kept stopping me.

Brand mention correlation to AI citation: 0.664. Backlink correlation: 0.218.

That's not a small gap. Brand mentions are three times more predictive of AI visibility than backlinks. Three times. The entire SEO industry is built around backlinks as the gold standard signal - and for Google rankings, that's still mostly true. But for AI citation, what matters is whether your name, your brand, your entity is being talked about across the web.

The Semrush data added another layer: nofollow links perform nearly as well as dofollow links for AI visibility. In traditional SEO, a nofollow link is worth significantly less. For GEO purposes, the signal isn't the PageRank transfer - it's the mention. The presence. The fact that you're being referenced.

This is a real reorientation. The question isn't just "who links to me?" It's "who talks about me, mentions me, references me across contexts?" Those are different things. The second one is what I had been ignoring completely.

It connects to something I wrote about leaving fintech to build AI systems - the whole reason I went independent was to build things that matter. Building an AI presence that's actually citable is part of that.

What the platform data actually shows

When I started auditing where brand mentions were coming from across the 168 platforms I ended up cataloging, some patterns were obvious in hindsight.

Platforms that block AI crawlers still contribute to traditional SEO - backlinks, referral traffic, domain authority signals. But they contribute zero to AI citation. If a major AI crawler can't index the content where you're being mentioned, that mention is invisible to the model. Doesn't matter how high-authority the platform is. Doesn't matter how many people read it. The AI never saw it.

Several platforms that have solid SEO reputations block AI crawlers entirely. A few you'd never expect have wide-open access. The robots.txt data across 168 platforms genuinely changed my prioritization for where to spend time building presence.

High-DA platform with AI crawlers blocked: useful for search rankings, useless for GEO. Medium-DA platform with full AI crawler access: directly contributes to citation potential. Those aren't the same trade-off at all. Treating them the same - which most SEO frameworks do - is leaving real GEO value on the table.

Why "SEO is dead" gets the diagnosis wrong but identifies a real symptom

I keep seeing the "SEO is dead" framing in newsletters, on Twitter, in founder group chats. It's not accurate but it's pointing at something real: the workflows that worked in 2020 are producing worse results in 2025. That's true. The feeling that something fundamental has changed is correct. The conclusion that SEO itself is dead is wrong.

What's happening is that SEO was always a proxy for something else - demonstrating that your content is trustworthy and relevant. Google built ranking signals as a proxy for that. Now AI systems are building citation signals as a different proxy for the same underlying thing. The underlying thing didn't change. The measurement changed.

If you had real expertise and real content depth, most GEO strategies will work for you because you actually have the substance those signals are trying to measure. If you were gaming SEO with thin content and link schemes, GEO is going to be harder because the signals it uses are less gameable. Brand mentions across real communities are harder to manufacture than backlinks. Genuine citations in credible content are harder to fake than directory submissions.

That's probably a good thing. The prompt engineering is dead conversation is related - these systems keep evolving in ways that reward actual depth over tactical gaming. GEO continues that trend.

The mistake is treating this as a binary. SEO or GEO. Old playbook or new playbook. It's additive. Everything that made content good for search still applies. Now there's additional surface area to optimize - entity signals, structured data, AI crawler access, brand mention distribution - that wasn't relevant before.

What I built to solve this

Once I understood the problem I started doing the research manually - checking which platforms actually allow AI crawlers, which ones block them in robots.txt, which ones have high GEO value versus medium versus low. Doing it by hand was a pain in the ass. So I built a system.

awesome-geo is the output: a curated, verified list of 168 platforms with full crawler access data. 142 of them are AI-discoverable. I scored them: 74 high GEO value, 78 medium, 16 low. Every platform has been manually verified against robots.txt for the major AI crawlers - GPTBot, ClaudeBot, Anthropic's crawler, Google-Extended.

I also built geo. deeflect.com - free tools that came out of doing this manually and wishing they existed. AI Visibility Checker, JSON-LD Generator, llms.txt Generator, Meta Tags Generator, robots.txt Generator.

I built this because I needed it and figured others would too. It's open source. Use it.

The reason I verified robots.txt across all 168 platforms is that it matters more than most people realize. A platform could have high domain authority and great SEO value - but if it blocks AI crawlers, it contributes zero to your GEO presence. Several well-known platforms do exactly that. Knowing which ones are actually AI-accessible changes your prioritization completely.

The other thing that came out of building this: the verification process itself is time-consuming in a way that scales badly. You can't just check robots.txt once. Platforms update their policies. A platform that allowed GPTBot in 2023 might have added restrictions in 2024. The landscape is moving, which means any static list becomes stale. The tools at geo. deeflect.com are built to stay current rather than being a snapshot.

This is part of the same instinct that drove building multiple projects solo - when I hit friction repeatedly, I build the thing that removes it, then make it available. The GEO research tooling is that, applied to discoverability infrastructure.

What to do right now if you care about any of this

You don't need to overhaul everything. Start here:

Check where you're mentioned. Brand mentions are the highest-correlation signal. Are you being referenced across contexts beyond your own site?
Add JSON-LD structured data. This is how you communicate entity information to AI systems. If you haven't done it, geo. deeflect.com has a free generator.
Create an llms.txt file. Similar to robots.txt but for LLMs - it gives AI systems structured information about who you are and what you do. Again, free generator on the site.
Verify your platforms allow AI crawlers. Check the robots.txt on any platform you're counting on for AI visibility. You might be surprised what you find.
Think about entity consistency. Your name, credentials, and core claims should be stated consistently across platforms. Inconsistency makes entity resolution harder for models.
Use citations and statistics in your content. The KDD '24 research is clear: this is a top GEO signal. Reference real sources. Include real numbers. This post does that on purpose.
Audit your content structure. Are your key claims written in excerptable chunks? Can a model pull a clean sentence or paragraph that stands on its own? If not, restructure. This is different from readability optimization - it's citation optimization.

The prompt engineering is dead argument I've seen floating around is related to this - the game keeps shifting toward higher-level signals, away from tactical optimization. GEO is the same shift applied to discoverability.

Where this goes from here

I'm just getting started on this.

The research took me deep enough that I have a lot more to share - how different AI systems handle citations differently, what the actual citation mechanics look like across ChatGPT versus Claude versus Perplexity, how to structure content specifically for AI summarization, what the verification data across 168 platforms actually reveals about the crawling landscape.

The fragmentation across AI systems is the next thing I want to dig into properly. Right now, most GEO content treats "AI visibility" as a monolithic thing. It's not. Being cited by Perplexity requires different signals than being cited in a Google AI Overview. ChatGPT's training data cutoff means recent content won't affect your visibility there until the next model version. Claude uses different weighting. These aren't the same problem. Treating them the same is leaving real optimization opportunities untouched.

This article is the intro. I'm building out a full GEO research series here - the tools are live, the data is real, and I'm going to keep digging.

If you've been heads-down on traditional SEO and haven't thought about AI visibility yet, now's the time to start. Not because SEO is dead. Because the finish line moved - and most people haven't noticed yet.

SEO got a co-pilot. Learn to fly both.

Tools and research: geo. deeflect.com - awesome-geo on GitHub

The Distribution Problem Nobody Talks About

Dmitry (Dee) Kargaev — Wed, 11 Mar 2026 00:00:00 +0000

I have 61 post drafts queued up. 91 reply drafts. 18 finished blog posts, voice-matched and slop-filtered, ready to go. The distribution problem nobody talks about isn't content scarcity - it's the gap between a full queue and zero published output. That gap is where builder ambition goes to die quietly, surrounded by perfectly organized markdown files and automated scheduling daemons that never fire.

Let me explain how I got here.

The distribution problem, in numbers

Last week I built a full content automation pipeline. About 20 hours of work. I also built a road trip planning app with Maps integration (3 hours), set up a music production pipeline with AI voice conversion (2 hours), and my agent stack ran 240 autonomous sessions across three days while I was offline doing whatever offline people do.

Content published: zero.

That ratio is not a typo. 25+ hours of building, 240 agent sessions processing work in the background, and the public output was nothing. The drafts sit in a queue. The queue is full. The queue has been full for weeks.

This is the distribution problem nobody in the builder community talks about, because talking about it means admitting the pipeline is a cope.

What I actually built

Let me give you the full picture so you understand how deep this goes.

The system is called CCC - Content Command Center. Here's what it does:

A scanner watches my X timeline and pulls content into a viral library. I've got 19 saved tweets in there right now. A remix engine takes those, plus my own writing patterns, and generates drafts in three streams: reaction tweets, personal/building-in-public posts, and evergreen content. Each draft goes through voice matching. Slop filtering. Then into schedule slots at 8am, 12pm, and 5pm with jitter built in so it doesn't look bot-like.

The posting layer runs through a Playwright daemon, headless Chromium on port 3381, because OAuth kept throwing 403s on replies. I spent a full afternoon debugging that. It works now. It posts to X and I've got Bluesky and LinkedIn integration planned.

The system is genuinely good. Multi-platform support, three content streams, voice-matched output, automated scheduling with a real distribution daemon underneath it. If I were selling this as a SaaS tool, I'd be proud of the architecture.

But there's 61 post drafts and 91 reply drafts sitting in the queue.

Nothing posted.

When your own agent calls you out

Here's where it gets embarrassing.

My AI agent system does weekly knowledge graph rebuilds. It maps entities and relationships across everything I'm working on - projects, patterns, decisions, outputs. I didn't ask it to do anything special. It just runs.

Last week it created a node called revenue_avoidance_pattern and connected it to multiple projects.

Not work_in_progress. Not pre-launch_phase. Revenue avoidance pattern.

The agent found this pattern by analyzing my behavior across weeks of data and decided it was significant enough to be a named entity in my knowledge graph. My tools are diagnosing me now. I built a system smart enough to identify that I'm using building as a substitute for shipping, and now I have to sit with that.

The knowledge graph has this node connected to OpenClaw, CCC, dee.ink (my 31 Rust CLI tools project), the blog, the social queue - everything. It's not pointing at one project as the problem. It's pointing at a pattern across all of them.

That's a different kind of feedback than a friend telling you to "just post more."

The psychology of building as avoidance

I've been building since I was 14. Started freelancing in design, shipped products for 70+ banks across 15 countries at VALK, won industry awards, got written up in Forbes and CNN. I know how to execute. The capability isn't the problem.

The problem is that building feels safe in a way that distributing doesn't.

Code works or it doesn't. The compiler tells you immediately. You fix it or you don't. There's no ambiguity, no social judgment, no public record of failure. When something doesn't compile you're not a bad person - you just have a bug. You fix the bug and move on.

Distribution is different. Distribution means putting your name on something, making a claim about it, and then watching the internet decide if it agrees. For an introvert who'd rather spend 14 hours in a hyperfocus coding session than send one networking email, that asymmetry is not small. The emotional cost of one negative reply can outweigh the satisfaction of 50 good ones. The brain doesn't do expected value math - it pattern-matches to threat.

So you build another tool. Another automation layer. Another pipeline. "I'll post when the system is ready." Except the system is never quite ready, because readiness is a moving target you control, and the internet's judgment is not.

I bought 15+ domains last month for projects I haven't started. One evening I spent scanning 3,495 TLDs for available domain names instead of posting the drafts I already had. That's not a productivity problem. That's scope expansion as a coping mechanism. Classic avoidance dressed up as preparation.

You're not preparing to launch. You're preparing to prepare.

This pattern has a name in psychology. Researchers studying creative avoidance call it "productive procrastination" - the tendency to fill time with legitimate-seeming work that doesn't advance the actual goal. A 2019 study published in Psychological Science found that people systematically underweight the cost of inaction when the alternative activity feels productive. Building a better pipeline is productive. It's also not shipping. The brain accepts the substitution.

The ADHD variable

The burst-crash cycle makes this worse in a specific way.

My ADHD brain goes hard for 3-4 days - hyperfocus, 14-hour sessions, ship 3 projects, write 10 drafts, rebuild the whole agent stack. Then I go quiet for 4-5 days. That's not laziness. That's recovery. The neuroscience is what it is, and I stopped feeling bad about the crash phase a while ago.

But the burst-crash cycle interacts badly with distribution.

Distribution requires consistency. Not quantity - consistency. Showing up on Tuesday when you don't feel like it. Posting the medium-quality take because good enough beats nothing. Engaging with replies during the low-energy period when you'd rather be offline.

Building can happen in bursts. You can build an entire app in a 3-day hyperfocus sprint and it'll be fine. The code doesn't care that you disappeared for a week after.

An audience does. The algorithm does. The compounding effect of consistent distribution is entirely undermined by a 2-week silence after a 3-day posting burst. I wrote about disappearing from Twitter for two months and watching the metrics crater. I know this. I still do it.

So the system I built - the pipeline, the scheduler, the daemon on port 3381 - is actually a real solution to a real problem. Automated distribution to compensate for the burst-crash cycle. Batched creation during hyperfocus, drip-fed output during the crash.

The system works. It's just not running because I haven't pressed go.

Why the distribution problem nobody talks about is actually structural

Here's the thing that took me a while to see clearly. This isn't just a personal psychology problem. It's a structural problem with how builders work, and the current AI tooling makes it worse before it makes it better.

We've had an explosion of creation tools. Claude, GPT-4o, Cursor, v0 - the cost of generating content or code has collapsed to near zero. What hasn't scaled is the decision-making layer on top of it. The question "is this worth publishing?" still costs the same amount of executive function it always did. Maybe more, because now you have 91 reply drafts instead of 9.

Abundance doesn't solve distribution. It makes the selection problem harder.

I've talked to enough builders in the AI space to know this isn't niche. The indie hacker forums are full of people with polished MVPs that haven't launched because they're "adding one more feature." The build-in-public community celebrates shipping but rarely discusses the pre-ship paralysis that affects most of the people who never make it to the public part.

The tools that exist for this are surprisingly thin. Buffer and Hootsuite solve scheduling. They don't solve the threshold decision. Ghost and Substack make publishing easy. They don't help you figure out which of your 18 drafts goes first. There's a real gap here and it's not primarily a technology gap - it's a decision architecture gap.

The research on decision fatigue is relevant. The more choices you have to make, the worse your decision-making gets over the course of a day. When I've spent 8 hours making technical decisions - model selection, prompt structure, error handling - I have nothing left for "is this tweet worth posting?" So I don't. The default behavior when executive function is depleted is to do nothing, and doing nothing means the queue grows.

The fix isn't better content. It's removing the decision from the critical path.

The solution I actually built (and then sat on for two weeks)

I'm not going to end this with five productivity tips. You've read those. They didn't work. Here's what I'm actually trying:

The CCC system has a minimum viable publishing threshold I added last week. If a draft scores above a certain quality bar and has been in the queue for more than 48 hours, it goes live automatically. No manual review gate. If I want to stop a post, I have to actively intervene. Default is publish, not hold.

This is anti-intuitive for someone who wants everything to be perfect. That's the point.

The technical implementation is pretty simple. Each draft gets a composite score on creation: voice match confidence (0-1), estimated engagement based on patterns from the viral library, and a topic freshness score that decays after 72 hours. Anything above 0.7 composite after 48 hours in queue triggers a publish. I can override with a HOLD flag in the draft metadata. But I have to do that actively - the default is go.

Here's what the draft metadata looks like:

{
 "draft_id": "ccc_20250118_042",
 "content": "...",
 "created_at": "2025-01-18T04:22:00Z",
 "scores": {
 "voice_match": 0.84,
 "engagement_est": 0.71,
 "freshness": 0.93
 },
 "composite": 0.83,
 "status": "QUEUED",
 "publish_after": "2025-01-20T04:22:00Z"
}

If status is QUEUED and publish_after has passed and composite is above threshold, the daemon posts it. No human in the loop unless I add a HOLD flag. Default behavior is publish.

I'm treating distribution the same way I treated the coding stack - build the system so the right behavior happens by default, not by willpower. Willpower depletes. Systems don't.

The irony is that the most publishable thing I've made in weeks is this post. The thing where I admit that I have 18 finished blog drafts and haven't posted any of them. The thing where I confess my agent system diagnosed me with revenue avoidance. The thing that took me two hours to write instead of the 20 hours I spent building the pipeline that would have made posting automatic.

Vulnerability is more interesting than competence. People can't relate to "I built a perfect system" - they can relate to "I built a perfect system and then didn't use it for two weeks because I was scared."

What actually breaks the loop

The blog drafts are going to start coming out this week. Not because I rewired my psychology. Because the pipeline now defaults to shipping and I have to do work to stop it.

Three things that are actually helping, not as a listicle but as honest data points:

Changing the default. The biggest unlock was making publish the zero-effort option and hold the effortful one. This is just Thaler and Sunstein's nudge theory applied to a content queue. Change the default, change the behavior, without requiring willpower or changed preferences.

Separating creation from curation. I stopped trying to decide whether something is worth publishing in the same session I wrote it. The draft goes in the queue, scores get calculated asynchronously, and the decision happens later based on the composite score - not on how I feel at 2am after a 12-hour build session.

Making the cost of inaction visible. The knowledge graph node was brutal but useful. When revenue_avoidance_pattern is sitting there in your entity graph connected to six projects, it's harder to pretend you're just being careful. I added a dashboard widget that shows days-since-last-publish. Right now it says 14. That number being visible every morning is uncomfortable in a productive way.

If you're a builder reading this, you probably recognize the pattern. The seven apps I built solo before any of them got real traction. The MCP server wrapping 56 APIs that I built and documented and then sat on for three weeks before publishing anything about it. The infrastructure-first, distribution-never cycle that affects probably 60% of the people building seriously in this space.

We talk about it in private. In DMs. In "lol I have like 30 unpublished drafts" jokes. But the actual posts are all about shipping, about momentum, about velocity. Not about the 3am moment when you realize your knowledge graph has a node called revenue_avoidance_pattern and it's accurate.

The distribution problem is real. It's structural. It's also solvable with the same tools I'd use for any other problem - designed around the actual constraint, which isn't capability. It's the friction between building and letting go.

The queue is full. Time to empty it.

If you're stuck in the same loop, my about page has context on what I'm building and why. And if your own agent stack starts diagnosing your behavior patterns, maybe take notes. It's uncomfortable but it's probably right.

castkit: CLI Demo Videos From One Command

Dmitry (Dee) Kargaev — Thu, 05 Mar 2026 00:00:00 +0000

Every CLI tool I've ever shipped had the same problem: zero visual presence. Then I built castkit - a castkit CLI demo video generator that turns any binary into a polished MP4 or GIF with one command. No screen recording software. No manual scripting. No video editing. Open source, written in Rust, and the meta-demo sells it better than I can: castkit generates its own demo video.

Check the repo before reading further if you want to see the output first.

Why I actually built this

Screen recording CLI tools is a pain in the ass. You open your terminal, run the command, inevitably typo something on the third take, realize the font looks weird, forget to hide your API key in the environment, and end up with a 45-second raw recording you now have to edit in Final Cut.

The existing options all have real tradeoffs. asciinema is free and captures terminal output, but the playback looks like a terminal log - no visual polish, no branding, nothing you'd put on a landing page. Screen Studio costs $89 and produces beautiful results, but you're still recording manually. VHS from Charm is the closest thing to what I wanted - declarative, scriptable - but you have to write a .tape file by hand for every recording. No auto-discovery. No agent support. Still requires you to think about the demo structure yourself.

The trigger for actually building this was working on AI coding agents. I kept hitting the same pattern: an agent builds a CLI tool, the CLI tool works, and then the last step is... what? Push to GitHub with a README? That's a dead end. I wanted the agent to be able to say "generate a demo video" as the final build step. No human intervention. No manual configuration. Just: here's a binary, produce something I can ship.

That didn't exist. So I built it.

What castkit actually does

The pipeline is: binary/command → discover → plan → record → redact → render → encode → MP4 or GIF.

Each stage does real work.

Discover reads your tool's --help output and README to understand what commands exist, what flags do what, and what a reasonable demo flow looks like. It builds a structured picture of the tool without you explaining anything.

Plan takes that understanding and generates a demo script as editable JSON. This is the part I'm most proud of. You can run castkit plan on any CLI tool, get a JSON file showing every scene, every command, every pause duration - and edit it before recording. It's not a black box.

Record runs the actual PTY recording with human-like typing jitter. Real typing cadence, not sleep 0.1 between every character. The variation is calibrated to feel like a person typed it, not a script.

Redact runs automatically before anything gets rendered. Environment variables, API keys, tokens - anything that looks like a secret gets masked. Safe by default. You don't have to remember to do this.

Render is where it gets interesting. castkit uses cosmic-text for text shaping and tiny-skia for pixel rendering. Full software renderer - no GPU dependency, no display server needed, runs in CI. It draws macOS window chrome (traffic lights, title bar, drop shadow), handles auto-zoom with easing, crossfade transitions between scenes, cursor smoothing with blink animation.

Encode pipes frames to ffmpeg for H.264 MP4 or GIF output.

The whole thing is a single binary plus ffmpeg. No runtime dependencies. No Python environment. No Node. Ship it anywhere.

How the castkit CLI demo video generator works end to end

The one-command path is:

castkit demo./your-binary

That's it. It discovers, plans, records, and renders without you specifying anything. You get an MP4 with branding, themes, and proper window chrome.

If you want more control:

castkit plan./your-binary --output demo-plan.json
# edit demo-plan.json
castkit record --plan demo-plan.json --output demo.mp4

The plan JSON looks like this:

{
 "title": "castkit demo",
 "theme": "catppuccin",
 "style": "dark",
 "scenes": [
 {
 "type": "command",
 "command": "castkit --help",
 "duration_ms": 2000,
 "pause_after_ms": 1500
 },
 {
 "type": "command",
 "command": "castkit demo./my-tool",
 "duration_ms": 3000,
 "pause_after_ms": 1000
 }
 ]
}

You can add intro cards, outro branding, adjust timing, swap themes. The plan is the contract between what you want and what gets rendered.

Themes and visual styles

Built-in color themes: catppuccin, tokyo-night, dracula, one-dark. Visual styles: dark, light, minimal, ocean, hacker. These aren't just color swaps - each style affects font weight, padding, window chrome treatment, and background rendering.

The hacker style does exactly what you think it does.

Two recording modes

Terminal mode is the classic full-terminal recording. The full PTY, command output scrolling, cursor - everything you'd see if you were sitting at the machine.

Web mode is for CLI tools that spawn a browser or have web UI output. It captures that context instead.

The technical stack

Rust was the right call here for a few reasons. The rendering pipeline needs to be deterministic and fast - you're compositing frames, and any variability in timing shows up in the output video. Rust's ownership model also made the PTY recording safe to implement without the race conditions you'd fight in Go or Python.

Key crates:

vt100 - terminal state parsing and capture. This is the core of getting accurate terminal output into a data structure we can render.
cosmic-text - text layout with proper Unicode support, ligatures, font fallback. CLI output has a lot of edge cases here.
tiny-skia - pure-Rust 2D renderer. No Cairo, no Skia binding hell, no native dependency chain.
portable-pty - cross-platform PTY. The thing that lets you actually run a real shell session and capture it.
clap - CLI interface. The irony of using a CLI framework to build a CLI demo tool is not lost on me.

The rendering pipeline goes: PTY session → vt100 parser → terminal cell grid → tiny-skia frame → raw pixel buffer → ffmpeg stdin. Each frame is fully rendered in software. That's intentional - it means the tool runs in headless CI without needing a display.

The auto-zoom is probably the feature that makes demos look professional without any work. When a command produces long output, castkit calculates the bounding box of the relevant content and applies a smooth zoom-in with easing. It's the thing that makes it look like someone edited the recording, when nobody did.

Why not Go or Python

I get asked this. Go would've been fine for the CLI surface, but the rendering layer involves pixel-level compositing with tight frame timing. The garbage collector pauses in Go show up as dropped frames when you're pushing raw buffers to ffmpeg at 30fps. Python isn't even in the conversation for a tool you want to distribute as a single binary.

Rust also gave me compile-time guarantees on the PTY session lifecycle. A PTY that doesn't get cleaned up properly hangs your terminal. With Rust's Drop trait handling cleanup, that class of bug just doesn't happen.

Running the castkit CLI demo video generator in CI

This is where it gets useful beyond the local workflow. Because castkit runs headless - no display server, no GPU, just CPU and ffmpeg - it works in GitHub Actions without any special setup.

name: Generate Demo

on:
 push:
 branches: [main]

jobs:
 demo:
 runs-on: ubuntu-latest
 steps:
 - uses: actions/checkout@v4
 - name: Install ffmpeg
 run: sudo apt-get install -y ffmpeg
 - name: Download castkit
 run: |
 curl -L https://github.com/deeflect/castkit/releases/latest/download/castkit-linux-x86_64 -o castkit
 chmod +x castkit
 - name: Build your tool
 run: cargo build --release
 - name: Generate demo
 run:./castkit demo./target/release/your-tool --output demo.mp4
 - name: Upload demo artifact
 uses: actions/upload-artifact@v4
 with:
 name: demo-video
 path: demo.mp4

Every push regenerates the demo. If your tool's output changes, the demo reflects it automatically. No stale GIFs in your README. The agent-native workflow I actually want is this: Claude Code builds the tool, the CI pipeline generates the demo, the demo gets committed to the repo. Zero human steps in the video production chain.

That's not hypothetical. I've been running this on a few of my own tools for the past few weeks and it works exactly as described.

Why this matters for CLI tool marketing

Most developers shipping CLI tools think of documentation as the end of the marketing funnel. It's not. The funnel is:

Someone sees the GitHub repo linked somewhere
They skim the README in about 8 seconds
They either get it or they don't

A 15-second GIF that shows the tool running converts browsers into users at a completely different rate than text. This isn't an opinion - every product person who's run A/B tests on landing pages with and without video knows this. The video wins. Every time.

But developers don't make demo videos because making demo videos is annoying. It's a context switch out of the thing you're building and into video production, which is a different skill set, different tooling, different mental mode. castkit makes that context switch disappear. You're still in the terminal. You're still thinking like an engineer. You just run a command.

The agent-native angle is where I think this gets genuinely interesting going forward. Right now I use Claude Code for a lot of my CLI development. The last step of that workflow - "generate a demo for this" - can now be a single tool call. The coding agent builds the thing and generates the demo as part of the same build step. No human in the loop for that part.

That's the version of developer tooling I actually want to work in.

What's missing and what's next

The current version has two gaps I know about:

Windows support is partial. The PTY layer uses platform abstractions but I haven't battle-tested it on Windows. If you hit issues, open an issue - I want to fix this.

Dynamic content is tricky. If your CLI tool outputs things that change every run (timestamps, generated IDs, random data), the recording captures whatever happened during that specific run. There's no redaction for content variance the way there is for secrets. You can work around this by mocking your tool's output in the plan, but it's not automatic yet.

On the roadmap: a --dry-run mode that shows you the rendered plan without executing real commands, support for recording multiple tools in one session (useful for showing integrations), and a web viewer for the JSON plans so non-engineers can review before recording.

The code is MIT licensed and the repo is open. PRs are welcome. If you build something with it, I want to see the demo.

The meta-demo point

I said it up top but it's worth landing: castkit generates its own demo video. That's not a marketing stunt. It's the proof of concept. If a tool can document itself, the core premise works.

Every CLI tool I build from here ships with a demo generated by castkit. Not because I've committed to some content strategy, but because it takes one command and it looks good. That's the bar. If it takes more than one command or it looks bad, it doesn't get used.

Right now it's at one command and it looks good. Check the 31 Rust CLI tools for what I'm building next, or browse by my coding stack if you want more of this kind of thing.

Go run castkit demo./your-tool on whatever you're building. If the output looks wrong or the plan generation misses something obvious, open an issue with your --help output. That's the fastest way to make the discovery smarter.

31 Rust CLI Tools Built for AI Agents

Dmitry (Dee) Kargaev — Thu, 05 Mar 2026 00:00:00 +0000

I shipped 31 open-source Rust CLI tools in one project. Not 31 features in one tool - 31 separate crates, each doing exactly one thing, each installable on its own. That project is dee.ink, and building it changed how I think about the right interface for AI agents. If you're building agentic systems and haven't thought hard about open-source Rust CLI tools for AI agents as an alternative to MCP servers, you're leaving serious efficiency on the table.

The short version: CLI tools are dramatically more token-efficient than MCP servers for AI agent workflows. I measured 35x in my own benchmarks. Once you see that number, you can't unsee it.

Here's how it happened and why it matters.

Why I built dee.ink in the first place

I run a multi-agent system called OpenClaw. If you've read castkit, you know it handles my daily workflow - morning digests, research, crypto monitoring, content scheduling, health data. It's not a demo. It processes real work every day.

Agents need to reach into the world. Check Hacker News. Look up SSL cert expiry. Parse an RSS feed. Generate a QR code. Turn a receipt photo into structured data. These aren't complex tasks but they come up constantly, and every time an agent needs to do one, it needs a tool.

The popular answer right now is MCP - Model Context Protocol, Anthropic's standard for agent tool-calling. I tried it. The overhead is real: each tool call needs a running server, connection setup, and verbose JSON-RPC framing. For stateful tools or bidirectional streams, that overhead makes sense. For "search Hacker News and return the top 10 posts," it's wasteful by design.

So I built CLIs instead. One tool per job. JSON output. No interactive prompts. Works with pipes. That's it.

After I'd built a few for my own use, I realized I had the start of something worth packaging and open-sourcing. dee.ink is the result: 31 standalone Rust CLI tools built specifically to be called by AI agents.

Open-source Rust CLI tools for AI agents vs MCP: the real argument

Let me be concrete about why CLI beats MCP for most agent tool use.

An MCP server for a simple search tool looks roughly like this from the agent's perspective: spin up the server process (or connect to an already-running one), send a JSON-RPC request with the method name and parameters, wait for the response envelope, parse the result out of the envelope. The agent has to know the MCP protocol, or more accurately, the framework wrapping the agent has to know it.

A CLI tool looks like this: ink-hn top --limit 10 --json. That's it. The agent gets back a clean JSON array.

The token efficiency gap comes from a few places. First, CLI invocation syntax is compact. A shell command is 5-20 tokens. An MCP request envelope is 50-200 tokens before you even add the parameters. Second, every LLM ever trained has seen millions of shell commands in its training data. They're native CLI speakers. They're not native JSON-RPC speakers - you can see this in how confidently models generate shell invocations vs. how often they fumble JSON-RPC schema details. Third, no server to maintain means no connection overhead, no process management, no "is the server running?" failure mode.

The 35x efficiency number comes from comparing token usage for the same operations across both approaches in my own OpenClaw setup. It's not a controlled academic study but it's real usage data from a system that runs 14 cron jobs daily.

There are cases where MCP is genuinely better. Long-running sessions where you want persistent state. Bidirectional streams. Tools that need to push updates back to the agent rather than return a one-shot result. For those, use MCP. But "search HN" or "check WHOIS" or "generate an invoice"? CLI wins every time.

One more thing people miss: debugging. When an MCP tool call fails inside a framework like LangChain or CrewAI, you're often staring at a wrapped exception with zero useful context. When a CLI tool fails, you have a shell command, an exit code, and stderr output. You can reproduce it in 10 seconds. That matters a lot when you're maintaining a system that runs overnight.

What's inside the toolkit

The dee.ink toolkit is 31 crates across six categories. Let me walk through them.

Data and research : dee-hn, dee-arxiv, dee-reddit, dee-wiki, dee-feed, dee-ph. These are the tools I use most. An agent can check Hacker News trending stories, pull an arXiv abstract by ID, search Reddit, look up a Wikipedia article, parse an RSS feed, or get Product Hunt launches - all in one command, all as JSON.

Financial : dee-invoice, dee-receipt, dee-rates, dee-pricewatch, dee-ebay, dee-amazon. Generate invoices, parse receipts, check exchange rates, watch prices, search marketplaces.

Personal productivity : dee-contacts, dee-habit, dee-todo, dee-timer, dee-stash. The local storage tools here use SQLite via rusqlite. Your data stays on your machine. Agents can manage your contacts, log habits, check todos, start timers, or stash arbitrary data for later retrieval.

Developer tools : dee-openrouter, dee-ssl, dee-whois, dee-qr, dee-porkbun. Check SSL cert expiry, run WHOIS lookups, generate QR codes, manage Porkbun DNS records, query OpenRouter for available models and pricing.

Location : dee-food, dee-events, dee-parking, dee-gas, dee-transit. Local-aware tools for finding restaurants, events, parking, gas prices, and transit schedules.

Social and trends : dee-crosspost, dee-mentions, dee-trends. Cross-post content, monitor mentions, check trend data.

Each crate is fully standalone. Installing dee-ssl doesn't pull in any shared dee-core dependency. You get exactly what you need, nothing more.

The technical stack

I picked Rust for a few reasons that aren't just "Rust is fast."

Binary size matters when you're shipping 31 separate tools. A Go binary for a simple CLI is usually 10-15MB. A stripped Rust binary for the same tool comes in under 3MB. When someone does cargo install dee-hn, they're downloading and compiling one focused tool. Small is respectful.

The other reason: Rust's clap v4 with derive macros makes argument parsing almost free to write. The --help output is generated automatically from your struct definitions. Every tool in dee.ink has --help with actual usage examples because making that happen is nearly zero effort.

Here's what the argument struct looks like for a typical tool:

#[derive(Parser, Debug)]
#[command(name = "ink-hn", about = "Hacker News CLI for AI agents")]
struct Args {
 #[command(subcommand)]
 command: Command,

 /// Output as JSON
 #[arg(long, global = true)]
 json: bool,
}

#[derive(Subcommand, Debug)]
enum Command {
 /// Get top stories
 Top {
 /// Number of stories to fetch
 #[arg(long, default_value = "10")]
 limit: usize,
 },
 /// Search stories
 Search {
 /// Search query
 query: String,
 },
}

For tools with local storage (todo, habit, contacts, stash), SQLite handles persistence via rusqlite. No database server, no config files to manage. The database lives at ~/.local/share/dee-toolname/data.db and everything just works.

HTTP is either reqwest (for complex clients with retries) or ureq (for simpler one-shot requests). I pick ureq when I can - it compiles faster and the binary is smaller.

Exit codes are strict: 0 for success, 1 for tool error, 2 for usage error. Agents need reliable exit codes to know if a command succeeded. This sounds obvious but a surprising number of CLIs return 0 on failure because someone forgot to handle an error branch. In agent workflows that kills you - your orchestrator thinks the call succeeded and moves on with bad data.

Agent-first design decisions

This is the part that's different from building a CLI for humans.

No interactive prompts. Ever. If an argument is missing, the tool errors out with a clear message. It doesn't ask "did you mean this file? (y/n)". Agents can't answer interactive prompts. A tool that blocks waiting for keyboard input is broken for agent use.

Every tool has a --json flag that guarantees structured output. Without --json, tools print human-readable text. With --json, they print a JSON object or array, always to stdout, always parseable. No mixed text/JSON output. No progress bars to stdout (they go to stderr or get suppressed).

Pipe support is first-class. Tools accept stdin when appropriate. You can chain them:

ink-feed parse https://hnrss.org/frontpage | ink-stash save --key hn-today

Each tool ships an AGENT.md file in the repo. This is a short markdown document explaining to an AI agent how to use the tool effectively - what the flags do, what the output schema looks like, common patterns. When an agent needs to use a tool it hasn't seen before, it can read AGENT.md and understand the interface without trial and error.

There's also a FRAMEWORK.md at the repo root that defines the conventions every tool follows. Any agent that has read FRAMEWORK.md can make reasonable guesses about how any dee.ink tool works. That's intentional. Consistency is the whole point.

What "consistent" actually means in practice

Every tool uses the same flag names for the same concepts. Pagination is always --page and --limit, never --offset or --per-page or --count. JSON mode is always --json, never --format json or --output json. Verbose mode is always -v.

This matters because agents build up a mental model of your toolset. If the first five tools they use have consistent interfaces, they'll correctly predict how the sixth one works. If your flags are inconsistent, the agent has to treat each tool as a new unknown - which costs tokens and causes errors.

It's the same principle as a good design system. The value isn't any single component. It's the pattern that makes everything predictable.

How open-source Rust CLI tools fit into a real agent workflow

Concretely, here's how one of these tools actually gets used inside OpenClaw. My morning digest job runs at 7am. It pulls Hacker News top stories, recent arXiv papers in a few categories, and Reddit posts from a handful of subs. Then it summarizes and formats everything into a digest I read with coffee.

The shell side of that looks roughly like:

ink-hn top --limit 20 --json > /tmp/hn.json
ink-arxiv search "multi-agent systems" --days 7 --json > /tmp/arxiv.json
ink-reddit hot r/MachineLearning --limit 15 --json > /tmp/reddit.json

Three commands. Three JSON files. The orchestrating agent reads those files, does the summarization, formats the output. Total token cost for the data collection phase: maybe 150 tokens across all three commands. The MCP equivalent for the same three sources would be three server connections, three request envelopes, three response envelopes. Easily 10x the token overhead, plus you need three MCP servers running.

That's not a toy example. That's the actual flow, running every morning.

The installation experience

cargo install dee-hn
ink-hn top --limit 5 --json

That's the whole install flow. No Docker. No Python virtual env. No npm. Cargo installs the binary, it goes in ~/.cargo/bin, and it's available system-wide. For agent use in particular, this matters - you don't want to manage environments when your agent needs to call a tool.

For people who want everything at once, I'm working on a meta-crate that installs the full suite, but honestly most people only need a subset. The standalone install is the right default.

You can find all 31 crates on the dee.ink site and browse the source on GitHub. Every tool is MIT licensed.

Why open source, and what I'm getting out of it

I'm not monetizing dee.ink directly. No SaaS wrapper, no premium tier. The tools are free, the code is open.

The actual return is authority and credibility. Shipping 31 production-quality Rust CLI tools is a more compelling signal than any portfolio piece I could write. Developers can read the code, use the tools, see the design decisions. That's a much better "hire me / work with me" artifact than a case study PDF.

It also forces quality. When something is public, you think twice about the shortcuts. Every AGENT.md, every --help example, every error message is a little more considered because someone else might read it.

And honestly? The tooling gap was real. I needed these tools for OpenClaw. If they didn't exist, I'd have built them for private use anyway. Open-sourcing them was 20% more work for a much better outcome.

What comes next

31 tools is a good start but there are obvious gaps. I'm planning tools for GitHub (issues, PRs, repo stats), calendar integration, and a few more financial data sources. The architecture makes adding tools easy - each one is genuinely independent, so adding dee-github doesn't touch anything that already ships.

I'm also watching how people actually use these in their own agent setups. If you're building with Claude Code, Cursor, or any other agent framework and want CLI tools that just work, this is worth checking out. If you find a gap, open an issue or PR. The whole thing is built to be extended.

You can the universal MCP server and follow the build log here on the blog. If you're interested in the agent workflow side - how OpenClaw actually orchestrates all of this - I'll be writing that up next. Subscribe or check back.

The tools are at dee.ink. The code is on GitHub. Install one and see if it fits your stack. And if you want to browse more posts on agent architecture and tooling, check the tags page.

The 3-Month Gap: Building AI Agents That Actually Work

Dmitry (Dee) Kargaev — Wed, 18 Feb 2026 00:00:00 +0000

The 3-month gap between building an AI agent and it being useful nearly broke me.

That's the stretch nobody posts about. I've seen a hundred Twitter threads about building agents. Maybe three honest ones about what happens after you ship v1 and start using the thing for real. The 3-month gap building AI agent systems from "it demos" to "it actually works without me babysitting it" - that's the real project. And it's almost nothing like the first 30 minutes.

I run a multi-agent system called borb on OpenClaw. It handles my daily workflow - reminders, research, content scheduling, code review, memory management. Fifteen-plus specialized agents running different models depending on the task. It's stable now. It does real work every day without me touching it.

It took three months to get there. Here's what that actually looked like.

The 3-month gap between building an AI agent and it being useful

Week one, everything works great. The demo is clean. The happy path is flawless. You show it to someone and they're impressed. You're impressed. You start thinking about what else you can build.

Then you try to use it for something real.

The model hallucinates a tool call that doesn't exist in your registry. The agent enters a loop - calling the same function twelve times, each time getting an error, each time deciding to try again. A cron job fires at 3am on a Wednesday and the API it needs returns a response format you've never seen before. The agent handles it confidently and incorrectly. You find out four hours later.

This isn't bad luck. This is what agents do. A system that needs to operate autonomously across real APIs, real data, real time zones, real rate limits - it's going to hit edge cases constantly. The demo doesn't hit edge cases because demos are choreographed. Production use is chaos.

The first two weeks, I was convinced I had a fundamentally broken architecture. I didn't. I just had a system that hadn't been stress-tested against reality yet. There's a difference.

Week 1-4: You don't know what you don't know

The early failures are the visible ones. The agent tries to call a tool with the wrong parameter format. Easy fix. The system prompt is ambiguous about which memory file to write to. Easy fix. Auth token expires overnight and everything fails silently. Annoying fix, but simple.

You add error handling. You feel productive. You think you're almost there.

You're not almost there. You've patched the surface layer. The deep problems haven't shown up yet because the deep problems only appear when conditions stack in ways you didn't anticipate.

What I didn't understand in week one: error handling for an AI agent isn't like error handling for deterministic code. With regular code, you can enumerate the failure modes and write a case for each. With an agent, the failure mode is sometimes "the model made a creative decision." There's no exception class for that. The agent didn't crash. It just did something you didn't want, then moved on.

This maps to what Anthropic calls "reward hacking" in their model specification documentation - agents optimizing for what looks like task completion without actually completing the task. It shows up constantly in production.

I had an agent that was supposed to add a task to my task file when I asked it to remember something. For about ten days, it was adding the task correctly but also occasionally appending a brief philosophical reflection on the importance of the task. Not always. Maybe 15% of the time. The task file worked fine. It was just... weird. I only noticed when I went back and read through it.

That's the thing about AI agents failing quietly. The output is often plausible enough that you don't immediately catch it.

Week 5-8: The edge cases get stranger

By week five, the obvious stuff is handled. The agent runs reliably on the happy path. You start feeling good about it.

Then the 3am edge cases start appearing.

Specific API returns a field as null instead of an empty string. Rate limit hits mid-task and the agent, instead of waiting, decides to rephrase the request and retry with a different tool. A time zone calculation goes wrong because of daylight saving time and a scheduled task fires an hour early. The agent interprets "do this daily" as "do this every time you're invoked" and runs a daily summary five times in a row on a Tuesday.

None of these are predictable from first principles. You can't design your way out of them upfront. You discover them by running the system and watching what happens.

My logging setup saved me here. I log everything - every tool call, every model response, every function output, timestamps on all of it. At 2am when something breaks weird, the logs are the only reason I can figure out what happened. If you're building agents and you're not logging obsessively, you will spend hours debugging by vibes instead of evidence. The OpenTelemetry docs on structured logging are worth a read if you want a sane approach to this - I adapted their structured format for borb's log output and it made parsing way easier.

The most expensive edge case I hit: I didn't have retry limits on one of my agents. It hit an API error that was never going to resolve - the endpoint was just down. The agent kept retrying. Each retry cost tokens. I woke up to $40 in API charges and an agent that had been stuck in a loop for six hours.

After that: every operation in borb has a max of five retries, then it stops and reports the failure. Non-negotiable. The agent's job is not to solve unsolvable problems. The agent's job is to do the task or tell me it can't.

Week 9-12: The architecture is wrong and you have to accept it

This is the hardest part of the 3-month gap building AI agent systems into something reliable.

Around week nine, I started noticing a pattern. Individual fixes weren't sticking. I'd patch one thing and something adjacent would break. The sub-agents were failing silently in ways that only became obvious when I traced back through the memory files. The memory itself was getting stale - agents referencing context from three weeks ago that was no longer relevant, treating outdated information as current.

The problem wasn't any specific bug. The problem was architectural.

My original memory system was a single MEMORY.md file. Everything the agents wrote went into it. This worked fine for the first few weeks when the file was small. By week nine, it was 8,000 words of mixed context - tasks, decisions, completed work, notes, research summaries, all in one place. The agents were pulling from it for context but the retrieval wasn't smart enough to distinguish "recent and relevant" from "old and stale." The whole thing was polluted.

I rebuilt the memory layer. Separate files for different context types - active tasks, completed work, long-term facts, agent-specific state. Added timestamps. Added a lightweight cleanup job that runs daily and archives anything older than two weeks unless it's flagged as permanent.

That's the thing about the debugging phase nobody talks about: some of what you're debugging isn't bugs. It's the architecture not scaling the way you assumed it would. Patches don't fix that. You have to rebuild parts of the system.

I wrote more debugging and refactoring code in month three than I wrote feature code in month one. That ratio felt wrong when I was in it. Looking back, it's exactly right. The feature code gets you to "it demos." The debugging gets you to "it works."

What model selection actually looks like in production

One concrete thing that took me too long to figure out: you can't use one model for everything.

My original borb setup used Claude Sonnet for everything. Consistent, predictable, good at reasoning. Also overkill and expensive for tasks that don't need it.

Now the system is tiered by task complexity:

Orchestration layer (deciding what to do, assigning to sub-agents): Opus. It needs to make judgment calls, handle ambiguity, and route correctly. Skimping here is the most expensive kind of cheap.
Complex reasoning tasks (code review, research synthesis, writing drafts): Sonnet. Good balance of quality and cost.
Simple lookups and fast operations (memory writes, formatting, scheduling checks): Flash. Fast, cheap, and the task doesn't need a frontier model anyway.

The cost difference is significant. Running Flash for the simple stuff cut my daily token spend by about 30% without any change in output quality on those tasks. Because a task like "write this event to the calendar file" doesn't need a 200-billion-parameter model. It just doesn't.

If you're building an agent system and everything is running on your best model, you're leaving both money and performance on the table.

The kill switch is a feature, not an afterthought

I added a kill switch to borb after week two. It's the best thing in the entire system.

One command stops all agents, freezes all cron jobs, and logs the current state of every active task. No partial writes. No half-completed operations. Just a clean stop.

I've used it maybe six times. Twice when something was clearly going wrong and I needed to stop the bleeding. Once when I pushed a bad config change. Three times during architecture refactors when I needed to be sure nothing was running while I was editing core files.

The kill switch means I can experiment aggressively because I know I can stop everything immediately if I need to. Without it, I'd be more conservative about changes - and slower progress because of it.

Building agent systems without a kill switch is like deploying to production without rollback. Technically possible. Categorically reckless.

Why nobody posts about the 3-month gap building AI agent systems that actually ship

Twitter is full of "I built an AI agent in 30 minutes" content. I've read it. Some of it's even technically impressive. But a 30-minute build that you're demoing to a camera isn't the same thing as a system that runs your workflow for 90 days without you touching it.

The demo is marketing. The three months after the demo is engineering.

Building in public usually means sharing wins - launches, milestones, metrics going up. The debugging phase has almost no wins. It's just slightly fewer failures each week than the week before. That doesn't make for satisfying content. It doesn't fit the narrative arc. Nobody wants to read "week six: fixed four edge cases, discovered three more."

But that's the actual work. That's the difference between a project and a product. A project works when you're watching it. A product works when you're not.

I'm not saying this to gatekeep. I'm saying it because if you're three weeks into running your agent and you're losing your mind debugging things you didn't expect - that's normal. You're not in the failure state. You're in the middle of the process. Keep going.

What you actually need to get through the 3-month gap

Based on borb, based on the $40 loop-retry incident, based on the stale memory architecture rebuild - here's what I'd tell myself at week one:

Log everything. Every tool call, every model response, every write. You will need these logs at an inconvenient time and you will be grateful they exist.

Hard limits on retries. Five max, then stop and report. The agent's job is not to solve problems that can't be solved. It's to do the work or surface the failure.

Memory needs maintenance, not just construction. Building the memory system is the easy part. Keeping it clean and relevant over months is the hard part. Budget time for it.

Model selection is architecture. Using the same model for everything is a shortcut that becomes a tax. Right model for right task from the start.

Build the kill switch before you need it. You'll need it.

And accept that months two and three look like a lot of debugging with very little to show for it. That's not failure. That's what it looks like to build something that actually works.

If you're thinking about building an agent system or you're already in the weeds, check out what I write about over on the blog. More of this in the ADHD and AI workflows. And if you want to understand the broader context of who's building this stuff and why, the the MCP server has that.

The demo is the beginning. The 3 months after it are the product.

Universal MCP Server: Two Tools, 56 APIs

Dmitry (Dee) Kargaev — Thu, 05 Feb 2026 00:00:00 +0000

I had 56 APIs I needed my agent to talk to. The idea of maintaining 56 separate MCP servers made me want to close my laptop and never open it again.

So I built one server that handles all of them.

That's the premise behind building a universal MCP server - and specifically, the pattern I implemented in Universal CodeMode: wrap any OpenAPI spec into exactly two tools, search and execute, and let the model figure out the rest. If you're running agents that touch multiple external services, this is probably the architecture you actually want.

The problem with the current MCP ecosystem

The Model Context Protocol is genuinely useful. It's the right abstraction for giving agents access to external tools and services. But the ecosystem has a fragmentation problem that nobody's really talking about.

The current pattern is: one API = one MCP server. Want GitHub integration? Here's a GitHub MCP server with 30 tools. Want Notion? Another server, another 20 tools. Weather API? Linear? Stripe? Each one is its own server, its own deployment, its own auth config, its own maintenance burden.

I run an agent called borb on my OpenClaw system. It needs to hit GitHub for repo management, search for research, weather for daily digests, and a dozen other services for various tasks. Following the standard pattern, I'd need 56+ separate MCP servers deployed and configured. That's not a system, that's a zoo.

The token cost is the other thing. A traditional MCP server for GitHub might expose 30 endpoints as 30 separate tools, each with its full parameter schema described in the context. That's easily 50K tokens just to tell the model what's available - before it's even made a single API call. At scale, that's insane.

Think about what that means in practice. You have an agent doing a simple task - create a GitHub issue, post a Slack message, look up a weather forecast. Three API calls. But before any of those happen, you've burned 150K tokens just describing the available tools across three MCP servers. That's money out the window for zero productive work. My monthly API spend across the whole OpenClaw system sits around $40. That number would be unrecognizable if I was loading full tool schemas for every session.

What building a universal MCP server actually looks like

The insight I'm building on comes from Cloudflare's "Code Mode" pattern. Instead of describing every possible tool upfront, you give the model two generic tools that work with any API:

search - natural language query against the OpenAPI spec catalog, returns the relevant endpoint spec (~1000 tokens)
execute - takes a spec chunk and parameters, makes the actual HTTP call

That's it. Two tools. Any API.

The flow looks like this: agent wants to create a GitHub issue, calls search("create a github issue"), gets back the relevant spec chunk for POST /repos/{owner}/{repo}/issues, then calls execute with the right parameters. The whole thing uses roughly 1000 tokens instead of 50K. That's a 50x reduction in token usage for a single API call sequence.

The key insight is that the model doesn't need to know every possible endpoint upfront. It just needs to know it can search for endpoints. The same way you don't memorize every function in a library - you know how to search the docs.

This also means the catalog can grow without any impact on the model's working memory. Whether you have 10 APIs or 500, the context overhead is identical: two tool schemas, a few hundred tokens. The model only loads the relevant spec chunk at the moment it needs it.

The stack

Universal CodeMode runs on Cloudflare Workers with R2 for spec storage and KV for caching. The core is TypeScript, using Hono for routing and the MCP TypeScript SDK for the protocol layer.

Here's the high-level architecture:

// Two tools. That's the whole interface.
server. tool("search", SearchSchema, async ({ query, catalog }) => {
 // Natural language search against indexed OpenAPI specs
 // Returns relevant endpoint chunk, not the full spec
 const results = await searchCatalog(query, catalog);
 return { content: [{ type: "text", text: formatResults(results) }] };
});

server. tool("execute", ExecuteSchema, async ({ spec, params, auth }) => {
 // Takes the spec chunk from search, builds and fires the HTTP request
 const response = await executeRequest(spec, params, auth);
 return { content: [{ type: "text", text: JSON. stringify(response) }] };
});

R2 stores the raw OpenAPI specs. When a spec is ingested, it gets indexed so the search tool can find relevant endpoints by natural language. KV handles caching so repeated searches on the same endpoints don't keep hitting the index.

The catalog currently has 56 pre-loaded API specs. Adding a new API is just ingesting its OpenAPI spec via the admin endpoint - the search and execute tools work automatically because they're operating on the spec structure, not hardcoded tool definitions.

Why Cloudflare Workers specifically

I could've run this on a VPS or a Lambda function. I went with Workers for three reasons.

First, edge deployment means low latency from wherever the agent is running. An agent mid-task waiting on an API lookup is a bad experience - every millisecond of overhead compounds across a multi-step workflow.

Second, R2 and KV are native integrations. No external database config, no connection pooling, no cold start issues with a separate storage layer. The spec storage and caching are just Workers primitives.

Third, the GlobalOutbound security model fits perfectly for this use case. I can declare exactly which domains the Worker is allowed to call outbound - which is exactly the security property I want for a server that executes arbitrary API calls on behalf of agents.

Security model

Running arbitrary API calls through a single server sounds like a security nightmare. Here's how I handled it.

GlobalOutbound restrictions on the Cloudflare Worker mean the server can only make outbound requests to explicitly allowlisted domains. You can't use execute to hit evil. example.com.

Admin token authentication gates the catalog management endpoints. Ingesting or deleting specs requires the admin token. The two main tools (search and execute) are accessible to agents without admin rights.

Execution timeouts on every outbound request. An agent can't hang the server by calling a slow endpoint.

Auth handling in execute is explicit - credentials get passed as parameters, not stored server-side in the MVP. There's a planned hosted version where you'd configure auth per-catalog-entry, but for now, the agent passes credentials and they're used once then discarded.

Is this perfect? No. But it's a real security model, not vibes.

The domain allowlist is probably the most important piece. A compromised prompt injection attack that tries to exfiltrate data to an external server fails at the network level, not just the application level. Defense in depth.

Self-hosted mode

Not everyone wants to run on Cloudflare. The project supports a self-hosted mode via npx:

npx universal-codemode --port 3000

Point it at your OpenAPI specs, configure your MCP client, done. The Worker version is the "cloud native" path, the npx version is for local dev or running on your own infra.

Config looks like this:

{
 "mcpServers": {
 "universal-codemode": {
 "command": "npx",
 "args": ["universal-codemode"],
 "env": {
 "CATALOG_PATH": "./specs",
 "ADMIN_TOKEN": "your-token-here"
 }
 }
 }
}

That's your entire MCP configuration. One entry. Covers every API in your catalog.

Compare that to what the equivalent config looks like with the standard one-server-per-API approach. If you're running five integrations, you have five entries in your MCP config, five different sets of env vars, five different deployment concerns. Something breaks and you're debugging which of the five servers is the problem. With this setup, there's exactly one thing to look at.

Test coverage

13/13 E2E tests passing against real APIs - GitHub, JSONPlaceholder, and httpbin. The test suite covers the full search-then-execute flow, auth parameter handling, error responses, and the catalog management endpoints.

$ npm test

✓ search returns relevant endpoint for natural language query
✓ search handles queries with no matching endpoints
✓ execute calls GitHub API with correct parameters
✓ execute handles 404 responses gracefully
✓ execute respects timeout configuration
✓ catalog ingestion processes valid OpenAPI spec
✓ catalog ingestion rejects invalid spec
✓ admin endpoints reject requests without valid token... (13/13 passing)

Real tests against real endpoints, not mocks. If the GitHub API's behavior changes, the tests catch it.

Testing against real APIs instead of mocks is a deliberate choice. Mocks give you confidence that your code does what you think it does. Real endpoint tests give you confidence that your code actually works. For infrastructure that agents depend on at runtime, I want the second kind of confidence. The tradeoff is that tests can fail for reasons outside my control - rate limits, API downtime, auth token expiry. That's fine. Flaky tests that catch real issues are better than reliable tests that don't.

Where this fits in the universal MCP server landscape

There are a few other projects trying to solve the "too many MCP servers" problem. Most of them are building aggregators - one entry point that proxies to multiple underlying servers. That doesn't solve the token problem, it just reduces the config burden.

The Code Mode pattern is different because it changes the interface. Instead of the model needing to know about github_create_issue and github_list_repos and github_get_pull_request as separate tools, it just knows about search and execute. The catalog can grow to 200 APIs and the model's tool interface stays exactly the same size.

OpenAPI as the common format is doing a lot of work here. Virtually every serious API publishes an OpenAPI spec now. That means the ingestion pipeline is universal - you're not writing custom parsers for each API.

There's also a maintenance angle worth thinking about. When Stripe updates their API, a traditional MCP server for Stripe needs to be updated and redeployed. With this approach, you ingest the updated OpenAPI spec via the admin endpoint and you're done. The search and execute tools don't change. The model's interface doesn't change. One operation, propagated everywhere.

Current status and what's next

MVP is deployed on Cloudflare Workers. The landing page is embedded in the Worker itself (Hono handles the HTML route). Repo is at deeflect/universal-codemode - MIT licensed, free to use.

What's still in progress:

Custom domain (cm. dee. ad is planned)
Real-world testing with Claude Code and Cursor agents, not just the E2E test suite
Seeding more API specs into the catalog - 56 is a start but the long tail of useful APIs is way bigger
Per-catalog auth configuration for the hosted version so agents don't need to pass credentials explicitly

The honest status: this is a working MVP with real test coverage, not a demo. But it hasn't been battle-tested in production with real agent workloads at scale yet. That's the next phase.

The spec seeding problem is actually interesting. There are thousands of APIs with published OpenAPI specs - the APIs. guru directory catalogs over 2,000 of them. Bulk-ingesting from that kind of source is on the roadmap. The architecture already handles it - it's a data pipeline problem, not an architectural one.

Why two tools is the right number

There's a version of this project where I expose five tools - search, preview_spec, execute, list_catalogs, check_health. I went back and forth on it.

Two is correct. Here's why.

Every tool you add to an MCP server is cognitive overhead for the model. The model has to decide which tool to use before it uses it. With search and execute, the decision tree is: do I know exactly which endpoint to call? No, search first. Yes, execute directly. That's it.

More tools means more opportunities for the model to pick the wrong one, more tokens describing the tool schemas, more edge cases in your implementation. The constraint of two tools forced me to make each tool more capable rather than adding escape hatches.

This is the same reason Unix pipes work. A small number of composable primitives beats a large number of specialized commands every time.

I tested a three-tool version with an explicit preview_spec step between search and execute. In theory it lets the model inspect a full spec before committing to an execute call. In practice, the model just used search and execute 95% of the time anyway and the extra tool added noise to every session context. Cut it.

The lesson: when you're designing a tool interface for models, err toward fewer, more capable tools. Models are good at using tools creatively. They're less good at picking between tools when the distinctions are subtle. Make the distinctions obvious by minimizing the surface area.

Building this as part of a larger agent stack

Universal CodeMode is one piece of my broader setup. If you're curious about how the agent system it connects to actually works - borb, OpenClaw, the cron-based orchestration layer - check out the 31 Rust CLI tools for agents for background, or browse the why prompt engineering died for more on how I think about building agent systems.

The short version: I run multi-agent workflows where different models handle different job types. Having one MCP server that gives all of them access to 56+ APIs without per-model tool configuration changes is the kind of infrastructure win that compounds. Less config, lower token costs, one place to update when an API spec changes.

Every agent in the system gets the same two tools. Sonnet, Codex, Gemini Flash, Opus - they all connect to the same universal server and they all work identically. When I add a new API to the catalog, every agent can use it immediately. Zero redeployment, zero config changes, zero new tool descriptions to fit in context.

That compounding is the real argument for this pattern. Each individual win - fewer tokens, less config, one deployment - is incremental. All of them together, across every agent, across every session, across every API call, adds up to something that materially changes what I can run sustainably as a solo builder.

If you're building anything in this space - agents that need to talk to external services, MCP server implementations, OpenAPI tooling - the repo is open. Issues and PRs welcome. I'm particularly interested in feedback from people running this with Claude Code or Cursor in real projects.

The pattern works. The implementation is early. Both of those things can be true.

Prompt engineering is dead

Dmitry (Dee) Kargaev — Sat, 10 Jan 2026 00:00:00 +0000

Prompt engineering as a skill is dead. I know that's a spicy take. I also know it's correct.

I've been building AI agent systems since March 2023. Went deep on prompt engineering early - custom GPTs with specialized knowledge bases, engineered prompt systems for UX research, meal tracking, viral content creation, planning assistants. I did the whole thing. Followed the guides, read the papers, obsessed over system prompt structure.

And somewhere in the last 12 months I realized: almost none of that matters anymore. The skill I spent real time developing got commoditized. What replaced it is harder, more valuable, and almost nobody's talking about it correctly.

Why prompt engineering as a skill is dead

Prompt engineering had about 18 months as a legitimate technical differentiator. Call it mid-2022 to late-2023. During that window, knowing how to coax useful output from a model was genuinely non-obvious. Chain-of-thought prompting improved outputs meaningfully. Few-shot examples helped a lot. Persona framing, careful instruction ordering, output format control - all of it made a real difference because the base models needed the help.

Then the models got better. Fast.

GPT-4 Turbo, Claude 3, Gemini 1.5 - these models understand intent. You don't need to hand-hold them through a task with elaborate prompting rituals anymore. Chain-of-thought? Current models do it automatically without being told to "think step by step." Few-shot examples? Models infer from conversational context. Carefully structured personas? You can write "you're a helpful assistant that specializes in X" and it works fine.

The elaborate stuff people were selling as advanced prompt engineering? It was always just "learning to give clear instructions to something that needed clearer instructions." That's not a skill. That's communication.

Here's the tell: if prompt engineering were a real technical skill, you'd need to understand something about how the system works to get better at it. But most prompt engineering advice is just... good writing advice with extra steps. Be specific. Give examples. State your constraints. These aren't insights about AI - they're basic communication principles.

The YouTube courses, the "certified prompt engineer" bootcamps, the LinkedIn posts about prompt frameworks - that whole industry built up around a skill that had a two-year shelf life. I'm not dunking on the people who built that content. It was real value at the time. It's just not the bottleneck anymore.

What actually killed it

Three things landed simultaneously and the combination was lethal for prompt engineering as a career path.

Better base models that understand intent. I can give Claude Sonnet a roughly-worded, slightly ambiguous instruction and it'll figure out what I mean. I don't need to craft it like a legal contract. The model's interpretive ability improved faster than the complexity of what people wanted to do with it.

Tool use and function calling. This is the big one. A huge chunk of what prompt engineering was trying to do was get models to simulate capabilities they didn't actually have. "Imagine you have access to a search engine..." Now you just give it a search tool. The prompt gymnastics were a workaround for the absence of real tool integration. Now that tool calling is standard, you don't need the workaround.

The realization that it's managerial, not technical. The best prompt engineers were good writers and good managers - people who could specify what they wanted clearly. That's a useful skill. It's not a technical skill. An engineer who spent years learning systems programming has depth that transfers. A "prompt engineer" who spent years optimizing instruction phrasing has a skill that's been automated by model improvements.

What replaced it: agent architecture and tool design

I run a multi-agent system called OpenClaw. 15+ specialized agents, each running a different model, handling different parts of my daily workflow - morning digests, research, code tasks, memory management, content scheduling, crypto position monitoring. The system processes around 40K tokens a day. Monthly API cost is about $40 because I'm aggressive about model selection.

The prompts for each individual agent? Maybe 20 lines. Sometimes less. Nothing fancy.

The orchestration code is thousands of lines. The tool integrations are where most of the work lives. The hard thinking went into: which agent handles what, how agents hand off context to each other, what happens when an agent fails, how memory persists across sessions, which model is cheap enough for a given task while still being capable enough to not screw it up.

That's the new skill. And it's actually a skill - one that requires understanding systems, trade-offs, failure modes, and cost structures.

Here's what the real work looks like now:

Model selection by task. I don't use one model for everything. Opus handles orchestration decisions that require judgment calls. Sonnet handles fast tasks where I need speed over depth. Gemini Flash handles research synthesis because it's cheap and good at skimming large amounts of text. Codex handles code because it's purpose-built. Picking the wrong model for a task either burns money or degrades output quality. That selection logic matters far more than prompt phrasing.

Memory architecture. A single model call is stateless. Real agent systems aren't. How you persist context across sessions, what you include in each agent's working memory, when to summarize versus retain full conversation history - these decisions affect whether your system stays coherent over time or slowly degrades into confusion. My system uses markdown files (MEMORY.md, SOUL.md, AGENTS.md) that agents read from on each run. It's simple and it works. Getting there took iteration.

Error recovery. Model calls fail. APIs go down. An agent returns a response that's correctly formatted but semantically wrong. What does your system do? Does it retry with the same prompt? Escalate to a more capable model? Log the failure and skip? Notify you? Building robust error handling into agent pipelines is where I've spent more time than on any individual prompt.

Tool design. When you build tools for your agents to use, you're essentially designing an API for a semi-autonomous system. The tool needs to do one thing clearly. The input schema needs to be simple enough that a model will call it correctly. The output needs to be structured in a way the model can reason about. Bad tool design leads to agents that can't actually use their tools effectively - and you'll spend time debugging model behavior when the real problem is your tool interface.

If you want to go deeper on how I think about agent systems and the principles behind building them, check out the universal MCP server for more context on where I'm coming from.

The MCP shift and why tool use is the real frontier

Model Context Protocol (MCP) is the clearest signal of where this is all heading. Anthropic published the spec, it's being adopted fast, and it formalizes something I've believed for a while: the future of AI capability is real tool access, not better instructions.

The mental shift is this: instead of engineering a prompt that makes a model pretend to have access to data, you give it actual tools to access real data. Instead of "imagine you're an expert at X with access to Y," you connect it to Y's API and let it query directly.

I built a universal MCP server with 56 API integrations. Notion, GitHub, Slack, calendar, weather, crypto data, Spotify, health APIs, web search, memory storage. My agents don't simulate access to these systems. They actually query them. The outputs are real, current, and accurate in ways that no prompt engineering trick could achieve.

That server took weeks to build. The prompts I wrote for the agents that use it took hours. If you're calibrating where to invest your time, that ratio is your answer.

The MCP GitHub repo has a solid list of existing server implementations if you want to see what's available before building your own. Don't rebuild what's already there.

Function calling and tool use aren't features bolted onto models - they're the architecture shift that makes models actually useful in production. A model that can query a live database, run code, check current prices, read a file, and send a message is categorically different from a model that only outputs text. The difference isn't prompt quality. It's what the model has access to.

What this means if you're building

If you're still spending serious time on prompt templates and prompt libraries, I'd push back on that investment. Not because prompts don't matter - they do, a little - but because the return on that investment has dropped sharply while the return on tool-building and orchestration knowledge has gone up.

The skills that matter right now:

Systems thinking. How do components fail? What are the dependencies? Where does state live? These are the questions that determine whether a multi-agent system works reliably or collapses randomly.

API integration. The ability to connect to external systems, understand their data models, handle their rate limits and errors, and build clean interfaces over them. This is the "prompt engineering" of 2025 - not glamorous, high leverage.

Evaluation. How do you know your agent system is working? Not just "it didn't crash" but "it did the right thing." Building evals for AI systems is hard, undervalued, and increasingly important as these systems touch real workflows. Hamel Husain has written well on this.

Cost optimization. Running agents at scale costs money. Knowing how to profile token usage, pick the right model tier for each task, and cache aggressively is a real engineering skill. I got my 15-agent system to $40/month through iteration, not luck.

For context on the 3 months of debugging agents, there's more there - I write about this stuff as I build.

The honest version of where prompts still matter

I want to be fair here. Prompts aren't zero-value. There are still cases where prompt quality meaningfully affects output quality.

Highly specialized domains where the model needs tight constraints. Anything involving consistent output formatting that downstream code parses. System prompts for agents that need specific behavioral guardrails. Evaluation prompts where you're asking a model to grade other model outputs.

But notice: these are narrow, specific cases. And even here, the prompt is usually less than 10% of the total engineering work. The rest is the surrounding system.

The people who'll tell you prompt engineering is still a high-value career skill are usually selling prompt engineering courses. The people who are building real systems with AI have mostly moved on.

If you're newer to this and trying to calibrate what to learn: spend maybe two weeks understanding how prompts work and what affects model behavior. Then spend the next six months learning to build systems. That's the right ratio.

The model will follow good instructions just fine. The question is what system you build around it.

Start with one tool. Connect your agent to one real API. See how differently the model behaves when it has actual access to something instead of simulated knowledge. That single experience will do more to reframe your thinking than any prompt engineering course.

I Stopped Posting on Twitter for 2 Months

Dmitry (Dee) Kargaev — Mon, 15 Dec 2025 00:00:00 +0000

I stopped posting on Twitter for two months. Not a planned break, not a "digital detox," not a strategic rebranding pause. I just... forgot. This is what actually happened when I disappeared from X (Twitter) for October and November 2025, and what I learned about taking breaks you didn't mean to take.

How I stopped posting on Twitter for 2 months without meaning to

September 6 was my last post before the gap. A week later I tweeted "staying away from X for a few days, wonder if it ruins reach" and then proceeded to vanish for two full months instead of a few days.

I wasn't planning that. There was no decision point where I said "I'm taking a break." I was building. Seven apps in parallel, deep in agent architecture, ADHD hyperfocus locked in. Twitter stopped feeling like a place I existed in. Not because I was boycotting it or burned out on the discourse. I just got absorbed and the habit broke.

That's the honest version. Not a detox story. I didn't meditate more. I didn't reclaim my attention span through discipline. I got pulled into something more interesting and social media fell off naturally. That's how ADHD actually works - when something grabs you hard enough, everything else gets crowded out.

The gap ran October through November. I came back January 26, 2026 with one post: "I'm back because of Clawdbot meta." No apology, no "I've been doing some reflection," no thread about what I learned. Just back.

What actually happened to reach when I stopped posting on Twitter

Short answer: yes, disappearing kills your numbers. I came back to impressions that were noticeably down from where they were in September. The algorithm punishes inconsistency in ways that are both predictable and annoying.

Here's what surprised me though - my follower count barely moved. The people who followed me for real reasons didn't unfollow during a two-month gap. They just... waited. Or forgot I existed but kept the follow anyway, which is functionally the same thing.

What did drop off were the engagement-farmers. The follow-back accounts, the people following hoping for a mutual, the ones gaming numbers. When I stopped posting, I stopped being useful to them. They left. Good.

So the actual damage from a two-month absence was:

Impressions down significantly on return
Algorithmic reach basically reset
Genuine followers intact
Low-quality followers gone

That's not catastrophic. It's annoying if you're trying to grow on a consistent curve, but it's not irreversible. Coming back with something real to say matters more than the reach penalty.

The platform rewards consistency, but it doesn't erase you for breaks. It's not that vindictive. It just forgets you for a while and you have to re-earn distribution. Which, to be clear, still sucks. But it's survivable.

Research from the Reuters Institute Digital News Report backs this up - audiences don't actively track individual creator absences the way creators fear they do. People are mostly watching the feed, not waiting for you specifically.

Building in public culture and why it makes breaks feel worse than they are

There's a specific kind of pressure that comes with building in public. The implicit rule is you have to be visible. Posting daily "day 47 of building X" content. Sharing every milestone. Being present enough that people remember you exist.

Miss a week and you feel like you're falling behind. Miss a month and it feels like career death. I'm not immune to this feeling. I've been building in public for long enough to have internalized the assumption that visibility = momentum.

But the two months proved that assumption wrong in at least one direction.

I built more in those two silent months than in the three months of posting before them. Seven apps. Real infrastructure. Clawdbot, which ended up being the thing I came back to announce. When you're not performing the work, you're just doing it. Turns out those are different modes.

The building in public model optimizes for consistency of output, not quality of output. There's value in that - accountability, community, people following along with the journey. But it can also turn into a content treadmill where the posting becomes the thing instead of the building.

I'm not saying building in public is bad. I'll keep doing it. But I'm now aware that the ADHD hyperfocus mode where I forget Twitter exists and just build for two months is also a valid mode. Maybe more productive in certain phases.

The ADHD angle: forgetting to post is not a failure

Everyone talking about "intentional breaks" from social media has the same energy: "I realized I needed to step back and prioritize my wellbeing." Very deliberate. Very curated. Very LinkedIn.

That wasn't this.

I didn't decide to take a break. The habit just... dissolved. I was on a 2am coding session in early October, deep in something, and the thought of tweeting about it didn't occur to me. Same the next night. By week two the pattern was just gone.

This is textbook ADHD. The executive function overhead of maintaining a social media posting habit - opening the app, forming a thought worth sharing, hitting post, checking the response - that whole loop requires a certain baseline attention budget. When something captures all of it, the habits that depend on leftover attention just stop running.

The ADHD and executive function research out of Harvard Medical School explains this pretty well - executive function isn't a moral failing, it's a resource allocation problem. When the resource is fully committed elsewhere, discretionary habits are the first to drop.

I've made peace with this. It's not undisciplined, it's just how my brain allocates. The flip side of forgetting to post for two months is also why I can build seven apps in parallel while maintaining an agent system that runs 14 cron jobs. Same mechanism, different outputs.

If you have ADHD and you've done this - gone silent for weeks because you were deep in something - it's not a failure mode. It's just the cost of the hyperfocus that also lets you ship faster than most people.

The trick is not building your brand strategy on a foundation that requires daily consistency you won't reliably deliver. Build it on depth instead. One good post after two silent months is worth more than sixty filler posts.

What "coming back" actually looks like

January 26. "I'm back because of Clawdbot meta."

That was the whole post. No explanation, no recap thread, no "here's what I learned while I was away" (except this post, I guess). Just the most direct possible signal: I exist, I have a reason to be back, here's the thing.

This felt right. The alternative - the big return post with the reflective thread - felt like it was performing a story instead of just getting back to work. The people who care will engage with the work. The people who need a narrative about why you were gone aren't really your audience anyway.

I did get some "welcome back" responses. More than I expected, honestly. A few people had noticed the gap and were curious what happened. That was kind of nice - it meant the previous presence had registered as real enough that the absence was notable.

But the bigger signal was that nobody was mad. Nobody had been waiting with a timer. The internet doesn't work that way. People move on, the feed keeps moving, and when you come back with something worth seeing, you get traction again.

What I'd tell someone about to take (or accidentally start) a social media break

Not going to frame this as advice, because I didn't plan any of it. But if you're reading this because you've already been gone for a month and you're wondering if you've tanked your presence - you probably haven't.

A few things that are actually true from experience:

Genuine followers don't leave during a two-month absence. The people who followed you for real reasons are still there. The follower count number that matters is quality, not quantity, and quiet people who actually care about your work have more patience than the algorithm does.

Your reach will take a hit and that's fine. You'll rebuild it. Reach is lagging indicator of consistency, and consistency can be rebuilt faster than you think when you come back with something real. I came back with Clawdbot. That gave me actual things to say.

The building you do during the silence compounds. Those two months produced more than the three months of documented-daily-grind before them. There's something to that. Not every phase of building should be public. Some of it needs to be quiet.

The "building in public" pressure is real and mostly self-imposed. The audience you're building in public for is smaller and more patient than the anxiety makes it seem. If you're good at what you do and you come back with evidence of it, people remember.

And if you have ADHD and you just disappeared because something grabbed you - that's the mechanism working, not failing. The work you did in the silence is the asset. The posts are just distribution for it.

What stopped posting on Twitter for 2 months actually cost me (and what it didn't)

I have how I grew to 500 followers where you can see the full context of what I work on - design background, fintech, AI engineering, all of it. None of it died during two months off Twitter.

The projects I was building kept building. The systems kept running. The professional relationships that matter don't live on Twitter anyway - they live in Discord servers, in direct messages, in shipped product people can actually use.

Twitter is distribution. It's a real tool and a decent one for this type of work. But it's not the substrate. The work is the substrate.

The two months off X taught me that more than anything. When the posting stopped, nothing important stopped with it. The important stuff was already running somewhere else - in the codebase, in the agent system, in the products actually getting built.

Coming back felt like turning a tool back on. Not like returning from exile.

That's the right relationship to have with it. Check what I was building instead to see what came out of those two silent months. And if you want context on how Twitter's algorithm actually handles inactive accounts, X's own creator documentation doesn't spell it out cleanly - but the pattern from third-party analyses is consistent: reach drops fast after ~2 weeks of silence, then stabilizes, then rebuilds within a few weeks of returning.

The algorithm is back to punishing me for the gap. I'm fine with it. The seven apps I built during the silence are more valuable than consistent impressions metrics would've been. Trade you'd make again without thinking about it.

My coding stack is 4 models deep

Dmitry (Dee) Kargaev — Wed, 10 Sep 2025 00:00:00 +0000

Nobody uses one AI model for coding anymore. Nobody serious, anyway. My multi-model coding workflow has me running four-plus models in sequence on most days, and I ship more in a week than I used to in a month. That's not hype - that's what happens when you stop treating AI coding tools like a single hammer and start treating them like a crew.

Here's the actual stack, why it's built this way, and what most people get wrong about "vibe coding."

What "vibe coding" actually means

It's not "ask ChatGPT to build me an app." That's how you get a 400-line index.js with no error handling that half-works for 20 minutes before collapsing.

Real vibe coding is coding with leverage. You still need to understand what the code does. You still need to catch hallucinations. You still need to make architectural decisions - maybe more consciously than before, because the AI will happily scaffold the wrong architecture at 10x speed if you let it.

What changed: the AI handles the typing and the boilerplate. The parts that used to eat 60% of my time - setting up file structure, writing CRUD routes, building form components I've built a hundred times - that's mostly automated now. What I bring is the product sense. Knowing what to build is harder than knowing how to build it. The "how" is commoditized. The "what" and "why" still aren't.

My background is product design. Ten-plus years of it, including a long stint as sole designer at a fintech platform doing $4B+ in digital assets. That taught me to think about systems - how pieces connect, what breaks under edge cases, where the user actually gets stuck. That context is the thing AI can't replace. It's why a designer who codes can actually get further with these tools than a CS grad who just knows syntax.

The four-model lineup

These are the main players. Each one has a job.

Grok (grok-code-fast) - The fast pass. I use this first. It's cheap, it's quick, and it's genuinely good at scanning code for obvious issues - logic errors, missing edge cases, things that'll blow up. I don't use it for fixes. I use it for triage.

Claude Code - The surgeon. Best AI I've used for understanding context and making targeted edits. If I have a function that's almost right but subtly broken, Claude finds the problem without bulldozing the surrounding code. It reads the file, understands the intent, and makes the minimal correct change. This is rare. Most models over-edit.

Codex - The builder. Heavy refactors, multi-file changes, greenfield scaffolding. When I need to restructure a whole module or build something from scratch, Codex is the move. It's not as precise as Claude on targeted edits but it handles scale better. It'll touch 8 files in sequence and keep things coherent.

Gemini (Flash or Pro depending on context) - The reader. When I need to analyze a large codebase or understand a sprawling chunk of code I didn't write, Gemini's context window is unmatched. I'll drop 50K tokens of code in and ask it to explain the data flow. It handles that better than anything else I've used.

That's the core four. v0 gets a slot for frontend component generation. It's the fastest way to get a working UI component I can then iterate on with Claude Code.

How the actual pipeline runs

Real example of how a feature goes from idea to done:

I write a rough spec in plain text. What the feature does, what the inputs and outputs are, what edge cases matter. This step is underrated. The cleaner my spec, the better the AI output at every stage.
If I'm building frontend: I hit v0 first. Describe the component, get something that renders. It won't be perfect but it's 80% of the way there in two minutes.
Grok does a fast pass on whatever exists so far. I'm asking it: "what's obviously wrong here? what breaks?" Quick, cheap, catches the low-hanging problems.
Claude Code handles the refinement. Takes the rough output, applies targeted fixes, handles the business logic, integrates with the rest of the codebase. This is where most of my back-and-forth happens.
If Claude hits a wall - something too structurally complex, or needs changes across multiple files - I hand it to Codex. Codex finishes the heavy lifting.
Grok reviews the final output. Another fast pass. At this point I'm looking for regressions and anything the other models introduced.
I read the code. Final check. I don't skip this.

That last step isn't optional. You cannot ship AI-generated code you haven't read. The hallucinations are sneaky. They produce code that looks right and runs right in tests but breaks in production on a Tuesday when someone does something slightly unexpected. Your name is on the commit. Read the code.

My multi-model coding workflow for backend vs frontend

Different problem types call for different entry points.

Frontend

Start with v0. Describe the component - what it does, any constraints, the general look. Get the initial render. Then switch to Claude Code for every iteration after that. Claude is better at "change this specific behavior" than v0 is once you're past the initial generation.

For anything that needs complex state management or integration with existing backend types, I'll write that part myself or with Claude Code from scratch rather than trying to wrangle v0's output. v0 is a starting point, not a production artifact.

Backend

Codex for scaffolding. Give it the data model, tell it what the API needs to do, let it build the structure. Claude for the business logic layer - anything that involves conditionals, data transformation, edge cases. Debugging: paste the error into whatever's fastest. Usually Grok for the first look, Claude if it needs deeper investigation.

For architecture decisions - where to put things, how services should talk to each other, what the data model should look like - I make those myself. I'll ask Claude or Grok to rubber-duck a decision with me, but I'm not delegating architecture to a model. That's the one place where the AI's lack of business context will hurt you.

The MCP layer

This is the part most people aren't talking about yet.

MCP (Model Context Protocol) tools change how much context your AI coding setup can actually see. Without good context, you're pasting code snippets manually and hoping the model understands the surrounding system.

The one I use daily: sammcj's devtools MCP. At ~9K tokens it gives me basically everything meaningful about a codebase - file structure, dependencies, key functions. I drop this context at the start of any substantial session and the model quality jumps immediately. Less hallucination, more accurate edits, better architectural suggestions.

If you're using AI coding tools without MCP integration, you're leaving a lot on the table. The model can only help you as much as it understands the system. Give it the context.

What this costs

People act like the price of these tools is a dealbreaker. It's not. But you should know what you're actually paying.

I'm on OpenAI Pro ($200/month). That's basically required if you're using Codex for real work - you need the usage headroom. Claude Pro or Max depending on the month. Grok is included in X Premium, which I'd pay for anyway.

All in, I'm spending around $350-400/month on AI tooling. I get somewhere between $800-1,200 of nominal usage value out of it based on what the APIs would cost at direct rates. And I ship maybe 3-4x faster than I did 18 months ago.

The mental model I use: I'm hiring a part-time engineer who's excellent at specific tasks, needs supervision, and occasionally hallucinates. At $400/month, that's a steal. The thing that makes it work is knowing which part-time engineer to call for which job. That's the whole game.

What breaks this workflow

Being honest about failure modes:

Bad specs kill everything. If I'm vague about what I want, every model in the chain produces something subtly wrong in a different way, and now I have five things to reconcile instead of one thing to fix. Time spent on the spec is time saved everywhere downstream.

Chaining without reading. Running model output straight into the next model without reviewing it first compounds errors. Grok flags an issue, Claude "fixes" it, Codex rebuilds around Claude's fix, and now the original problem is buried under three layers of AI decisions you didn't validate. Read between steps.

Using the wrong model for the task. Claude on a large multi-file refactor gets expensive and sometimes loses coherence. Codex on a targeted three-line fix is overkill and occasionally makes it worse. Matching the model to the job isn't optional.

Architecture by committee (with AIs). I did this once - asked three different models how to structure a feature, got three different answers, tried to synthesize them. Waste of time. Architectural decisions are mine. I use models to validate and poke holes, not to make the call.

The honest take on vibe coding

Most of the backlash against it is people who watched someone generate slop and extrapolated. Most of the hype is people who generated something that worked for 10 minutes and declared victory.

The actual version is somewhere more boring and more useful: AI handles commodity work at speed, you handle judgment. The skill ceiling on building shifted. You don't need to memorize syntax. You do need to know what good architecture looks like, how to write a tight spec, when the AI is confidently wrong, and what the user actually needs.

I couldn't do this workflow without 10 years of product and design experience. Not because the tools are hard - they're not. But because the inputs I give them are shaped by that experience. The prompts are good because I know what I want. The reviews catch problems because I know what breaks. The architecture holds because I've seen what doesn't.

If you want to build a similar stack: start with one model and learn it well. Understand its failure modes. Then add another model where you keep hitting walls. Don't adopt the whole thing at once - you'll just have four sources of confusion instead of one.

Read everything you ship. That's the one rule that doesn't have exceptions.

You can see more about what I'm building and how I think about this stuff why prompt engineering is dead. And if you want to dig into related topics - multi-agent systems, prompt engineering, tool comparisons - check out the MCP server I built to find threads that go deeper.