Forem: Mladen Stepanić

Most of Your AI Stack Is a Year Old

Mladen Stepanić — Sun, 17 May 2026 15:58:50 +0000

Every few days another "here's my May 2026 AI coding workflow" post lands on my feed. They're usually polished. They're usually proud of a two- or three-tool stack. And they almost always paper over how much of the showcased workflow has been available for a year, in one tool, on a cheaper subscription.

This isn't an attack on the people writing those posts. I've done versions of it myself. The workflow-as-content-format is useful: it surfaces patterns, lets people copy what's working. The problem is that the way these posts are usually framed (here is the May 2026 stack you should be running) flatters the stack at the expense of the workflow underneath. That has the order wrong.

Workflow engineering is the discipline of building the loop. The tools are the cheapest part of the loop. If you treat tool-stack composition as workflow design, you end up paying twice for one workflow and calling that progress.

The mirroring tax

The clearest tell that a workflow has too many tools in it is the appearance of a third file whose only job is to keep the first two in sync.

The pattern goes like this. You start with Claude Code. You add Cursor for the visual bits. Both tools have a project-level instructions file (CLAUDE.md for one, a Cursor rules file for the other) and they drift apart, because of course they do. The fix you reach for is a shared conventions file at docs/conventions.md or similar, with CLAUDE.md and the Cursor rule both referencing it. One source of truth, two readers.

That sounds clean. It is not clean. It is a mirroring tax. The third file exists because you chose two tools. The tax shows up in every adjacent system. You mirror MCP servers across both tools, you mirror skills, you mirror plan-mode-vs-execute-mode habits. The cost is not the $20 a month for the second subscription. The cost is the cognitive overhead of maintaining parity between two stacks that will keep trying to drift.

The best evidence that this is a real problem and not my aesthetic preference is that the mirroring tax has spawned its own CLI ecosystem. rulesync unifies rule management across Claude Code, Gemini CLI, and Cursor. rule-porter converts Cursor rules into CLAUDE.md, AGENTS.md, and Copilot instructions. cursor2claude is exactly what it sounds like. There are write-ups on doing it with symlinks Unix-style, with a unified .ai/ folder, or with explicit reference syntax. The industry is now converging on AGENTS.md as a unifying standard, which is itself an admission that maintaining N copies of the same conventions across N tools is a cost worth eliminating.

When the workaround for your stack choice is a CLI ecosystem of sync tools, the stack choice is the problem.

If you started from "what's the smallest stack that gets the job done," you would not invent the mirroring file. You would pick one tool and add a diff viewer for the rare case you wanted to watch a change land in an editor. That is a cleaner workflow. It is also less photogenic, which is partly why nobody writes the post.

What's actually new in May 2026 (spoiler: not much)

Three changes now matter for this argument. None of them are step changes. All of them get sold as if they were.

/goal is a Ralph loop with a verifier. Geoffrey Huntley named the pattern in 2025: a dumb persistent loop that feeds the model its own output until a completion condition holds, named after Ralph Wiggum because the philosophy is ignorant, persistent, optimistic. People have been running Ralph loops in bash for most of the past year. Anthropic ships a first-party ralph-wiggum plugin in the official anthropics/claude-code repo. They are not pretending the lineage isn't there.

What /goal adds is a separate small model (Haiku by default) as the verifier. The agent doing the work doesn't get to decide it's done. That separation is a real design move, and it's worth crediting. But the loop shape is Ralph, the "completion condition until met" philosophy is Ralph, and the part of /goal that gets the breathless framing in workflow posts (let an agent run for hours or days without supervision) is the part that was already possible. The newness is incremental on a year-old pattern. Treating it as the autonomy unlock is a category mistake. And a dangerous one at that. I wrote about that specific failure mode in Remote Slop with Claude Code.

There's also a regression worth noting. The /goal evaluator doesn't run commands or read files independently. The proof of completion has to live in the conversation. A hand-rolled Ralph can grep test output directly, check git status, count files in a queue. So in some real ways /goal is less capable than a mature bash loop, just much cleaner to use. Fair trade for most cases. Just not a different category of system.

Claude Design ate the design-mode justification. This one is recent enough that I have some sympathy for posts that don't engage with it. Claude Design launched on April 17, 2026. It's bundled with Pro, Max, Team, and Enterprise, which is the same subscription you're already paying for if you use Claude Code seriously. I wrote about my own experience with it in Vibe Designing. It is not a drop-in replacement for Cursor's in-editor design mode, and I want to be careful here. They're not the same product. But the case for paying for Cursor specifically for design work got a lot weaker in April, and the May 2026 workflow posts that still lean on "Cursor for design mode" as a load-bearing justification haven't caught up.

The pattern that actually works, and that I've been running on Shelfcritter, is to build the design system in Claude Design first and then instruct Claude Code to build on top of it. You drop in screenshots of the parts that bug you, ask Claude Design for a design system and a redesigned flow, iterate until the clickable prototype is something you'd actually want to use, and then hand Claude Code the prompt-with-link that Claude Design generates. Claude Code reads the design tokens and the component hierarchy directly. No copy-pasting CSS variables. No reverse-engineering spacing from screenshots. The handoff is the whole point: the design system as the contract between the design tool and the coding agent.

The same shape works with other stacks. Figma piped into Codex. Design tokens fed straight to Cursor's agent. A screenshot and a prompt into whatever model you trust. The point is the handoff, not the brand. A second editor is not load-bearing in any of them.

The Playwright skill closes the rest of the visual gap. Add a Playwright skill to Claude Code and the agent can launch the running app, screenshot it, click through flows, and iterate against what it sees. That covers most of the "I want to see the diff land" case that the GUI was supposed to handle. It's not as nice as watching a real editor render a change, but it's sufficient for the workflow most developers run.

What's left after all that? Tab predictions. The inline editor feel. Parallel cloud agents if you genuinely use them. Real things, narrower than the workflow posts imply. If you're a developer who uses tab-prediction loops every day (I'm not), keep paying for the tool that gives you the best one. If your case for the second subscription is "design mode and long-running agents," check whether the case still holds in May 2026. It probably doesn't.

The sunk cost rationalization

Here is what I think is actually going on in most of these workflow posts.

The author built a Claude Code workflow, then added a second tool when the second tool was the obvious move for the visual part of the job. They invested in skills, conventions, MCP setups across both. The world shifted underneath them: Claude Code got /goal, Claude Design got bundled, Playwright skills got good enough. The justifications that made the second tool obvious got softer, one by one.

But the muscle memory is still there. The conventions file is still there. The parallel cloud agents are still firing off before meetings. So the workflow stays the same shape, and what gets written up is this is how I work in May 2026, when an honest version would be this is how I work, and most of the load-bearing reasons for the shape are six to twelve months old.

That's a fine way to work. People are allowed to keep workflows they're comfortable with even when the underlying argument for them weakens. It's a strange thing to recommend to other people as the May 2026 stack, when somebody starting fresh today would pick a smaller one. The fact that I run the Anthropic stack when their limits had half the internet writing breakup posts is just the practice-what-you-preach bit. I use Claude Code because I'm used to it and it solves all of the issues I encounter. You might opt for Codex. Or Kimi. Or no AI at all. There is no silver bullet.

The New Stack ran a piece earlier this year titled "Cursor, Claude Code, and Codex are merging into one AI coding stack nobody planned". The headline does most of the work. Convergence is happening, accidentally, and the developer is paying for the lack of planning. The piece itself lands on composability as the answer, and I half-agree: it's nice to have the option to compose tools when you need it. But the more durable move is to learn the concepts underneath the tools and switch when the industry actually requires it. Once you've internalised the patterns, transitioning between tools is cheap. Composing more of them rarely is.

We've been here before

In After the Panic, I told juniors not to sprint after every new agent framework, because I've lived through JavaScript framework fatigue and I know how this ends: the industry always converges on a handful of choices that actually work, and the rest quietly disappear into blog post graveyards.

The shape of AI tool fatigue in 2026 is the same shape as JS framework fatigue was in 2016, with one difference that matters. AI tool fatigue is daily; JS framework fatigue was monthly. The rate at which new tools land, demand evaluation, and get written up as the new best workflow is significantly higher than anything web developers were dealing with a decade ago. Most of these tools, like most of the JS frameworks that briefly mattered in 2014–2018, will be merged, deprecated, or quietly absorbed by the survivors within eighteen months.

The cost of running a two-tool workflow today is not just the mirroring tax. It's also the bet that both tools will still be around, in their current form, when you've finished investing in the parity overhead. That bet looks worse the more closely you look at the history.

The actual May 2026 minimal stack

For most working developers I'd put the floor at this:

One capable agent (Claude Code, Codex, Kimi, whichever clicks for you) with a small set of skills you've earned by repeating yourself. The Stop Repeating Yourself rule still holds: anything you've copy-pasted twice belongs in a skill, or whatever your agent's equivalent primitive is called.

A way to let the agent see what it builds: a Playwright skill in my case, or the equivalent in your stack. An MCP setup kept tight: four to six servers, rotated when projects change, not twenty mirrored across two clients. A design tool that hands off cleanly to the agent, for when you need actual prototyping or visual artifacts. An issue tracker if you want parallel agents working through scoped work. I use Beads for this, but anything that produces small, well-scoped units the agent can chew through one at a time will do.

One subscription per layer. One conventions file. One stack.

For me that maps to Claude Code, Playwright, Claude Design, and Beads. The shape is the point; the brand names are negotiable.

If your work really needs a second editor on top of that (a design mode Claude Design can't carry, parallel cloud agents you actually use, a tab-prediction loop you can't live without), then add it. That's a workflow decision, and a defensible one. It is not the default.

This isn't only my preference. BuildMVPFast's piece on AI tool fatigue reports that the developers who seem least stressed about AI tools "picked their stack and stopped looking," and that the most common answer is "I use Claude and that's it." The least-fatigued developers are already running the smaller stack. The workflow posts that read like trophy cases are coming from the other group.

The part that compounds

The leverage in 2026 is not in adding tools. It is in keeping the stack small enough that you can still see what's happening inside the loop. Every additional tool is a piece of the loop you no longer fully control, and the workflow posts that read like trophy cases tend to be the ones where the author has lost the most of that control without noticing.

The thing that compounds is the skill library, the conventions you've codified, the Ralph-shaped loops you've learned to write good completion conditions for, and the review discipline you bring to the agent's output. None of that is tied to a specific editor. All of it survives the next tool change. That is what survived JavaScript framework fatigue too. The people who came out of the 2014–2018 cycle in good shape weren't the ones who learned every framework. They were the ones who learned the things underneath that every framework had to honor. Boring things, mostly.

The editor matters less every quarter. The workflow underneath is what's left when you turn the trophy case off.

I'd rather write a boring workflow post than a loud one.

Reading for later

A few pieces I want to come back to for this argument, beyond what's already linked above:

Cursor, Claude Code, and Codex are merging into one AI coding stack nobody planned, The New Stack. The accidental-convergence framing.
AI Tool Fatigue 2026, BuildMVPFast. Source of the "I use Claude and that's it" coping pattern.
everything is a ralph loop, Geoffrey Huntley. The original Ralph piece, for the /goal lineage section.
Claude Code's /goal separates the agent that works from the one that decides it's done, VentureBeat. The worker/judge split framed from Anthropic's side.
Two AIs, One Codebase: Why I Run Claude Code and Cursor Side-by-Side, Stark Insider. A representative example of the two-tool workflow post if you want to engage one directly.
rulesync, rule-porter, cursor2claude. The mirroring-tax CLI ecosystem. Worth scanning the READMEs to see how the authors describe the problem.

Code Reviews: The Part of the Loop Almost Nobody Tracks

Mladen Stepanić — Wed, 06 May 2026 10:58:03 +0000

Pull requests aren't what they used to be.

A PR used to be a story. Someone planned a thing, broke it into pieces, made a few commits as they figured out the shape of the work, opened a PR, got feedback, iterated. You could read it. You could reason about it. The diff had a narrative.

Now a lot of PRs land as one commit, a couple hundred lines touching a bunch of files, generated in an afternoon. No plan, no decomposition, just "here's the thing, please review." If you're lucky, the description was written by a human. If you're unlucky, that was generated too.

// Detect dark theme var iframe = document.getElementById('tweet-2051748508982730784-804'); if (document.body.className.includes('dark-theme')) { iframe.src = "https://platform.twitter.com/embed/Tweet.html?id=2051748508982730784&theme=dark" }

Generation got a hundred times cheaper, review didn't.

The Metric Trap

Some of what I'm about to say isn't new. Addy Osmani wrote a great piece earlier this year covering a lot of the same territory, with a focus on the practical side: the PR contract, how solo and team workflows diverge, concrete principles for keeping AI review tools useful instead of noisy. If you want the field manual, read his post. I'm trying to do something different here. I want to step back and look at why we ended up in this situation in the first place, and what the metrics we've chosen to measure are doing to the people who carry the cost.

A lot of companies are now measuring AI adoption by tokens spent, which should make you tilt your head.

Tokens. Spent. As if the goal of buying a tool is to use as much of it as possible.

You bought a drill. The value of the drill isn't the number of holes you put in the wall. It's that the holes you actually needed got drilled faster and in the right places. A team that measures success by hole count ends up with a wall full of holes and nothing hanging on it. Drill enough of them in the wrong places and you start weakening the wall itself.

Tokens measure activity, not outcomes. They tell you a developer typed something into an agent. They don't tell you whether the agent built the right thing, whether the developer understood what came back, or whether the resulting code is going to be readable in six weeks. A developer who writes a tight prompt, gets a focused diff, and ships a clean PR will burn fewer tokens than someone who spirals through fifteen "no, do it differently" cycles before pushing the wreckage into a branch. Guess which one looks more productive on the dashboard.

Tokens aren't useless. They're a fine conversation starter. If a developer is spending nothing, that's worth a chat: do they not need it, or do they not know how to use it? If a developer is spending a lot, that's also worth a chat: are they actually doing more work, or is their workflow a mess that needs cleaning up? Either direction, the number is the prompt, not the answer.

If you want this argument made with a much bigger dataset behind it, the 2025 DORA report (now renamed State of AI-assisted Software Development) goes there explicitly. It calls out lines-of-code-generated-by-AI and AI-suggestion-acceptance-rate as vanity metrics, and the follow-up telemetry from partners across more than twenty thousand developers shows what happens when nobody is looking at the other half of the loop: PR size up around 50%, review times up several times over, and incidents per PR climbing in step. The shape of the problem is the same one I'm describing. The only thing my version adds is a name for the failure mode and a less polite tone.

The moment you put tokens on a leaderboard, you've stopped measuring AI adoption and started measuring AI noise. The noise has to go somewhere. It goes into the review queue.

Where the Cost Lands

I wrote a few weeks ago about how AI broke the junior pipeline. The entry-level work that used to turn juniors into mediors got automated away, and companies eliminated those roles for short-term cost savings while quietly building toward a cliff. The seniors who remain are stretched thin, mentoring nobody because there's nobody to mentor.

Code review is where that bill comes due.

The old apprenticeship had a shape. Juniors wrote docs and small implementations, mediors reviewed them and corrected the obvious stuff, seniors got involved when something was actually hard. Knowledge moved up the chain because work moved up the chain. Reviews were how you learned what "good" looked like.

That ladder is gone. It didn't fall, it got squeezed. Mediors are doing the work juniors used to do, with AI in the loop. Seniors are reviewing the work mediors used to handle. There's nobody underneath catching the obvious stuff anymore, because there's nobody underneath. Every AI-generated PR from every level of the org eventually lands on the same small group of senior reviewers who were already the most expensive, most overcommitted people in the building.

Those PRs aren't the focused, narratively-coherent diffs that used to come up the chain. They're a few hundred lines dumped in one commit, generated faster than anyone wants to read them, with commit messages like "implement feature."

Generation went up, review capacity didn't, and the asymmetry has a name. The name is "the senior who used to enjoy their job."

The Escape Hatch That Isn't

Faced with a queue that won't stop growing, reviewers do the obvious thing: they reach for AI to help review.

I'm sympathetic. I've done it. AI is genuinely useful for the mechanical parts of review: style, dead code, missing null checks, the test that doesn't actually test the thing it claims to test, the obvious off-by-one. If you treat it like a fast, tireless intern that only catches the easy stuff, it earns its keep.

The problem is the part it can't do.

I've been thinking about this HN thread on AI porting code from one language to another, in the context of the Bun rewrite from Zig to Rust. The mechanical translation is the easy part: syntax, structure, type mappings. The hard part is semantic equivalence. Does this new version actually behave the same way under load, at edge cases, in the weird corners nobody documented? You only know if your port is correct by running both forever and diffing the output.

Code review has the same shape. The mechanics are the easy part. The hard part is the stuff that isn't written down anywhere: business rules, product intent, the invariant in module X that everybody on the team knows about but nobody put a comment on, the reason we did it this weird way three sprints ago. AI doesn't know any of that. It can't. It hasn't sat in the planning meeting where you argued for an hour about whether this customer segment counts.

When an AI review comes back clean, the senior reviewer is actually in a worse position than before. The PR has been vetted, sort of. The mechanical stuff is fine. But the actual question, should this code exist, in this shape, doing this thing, is still entirely on them. They still have to go through every file the way they always did, because the bot can't tell them which changes matter. The job didn't get easier. It just got harder to justify how long it takes, because on paper they had help.

It's AI all the way down, except at the one layer where judgment is non-negotiable. That layer is exactly where seniors live.

The Boring Fix

I'm an advocate for boring solutions that actually work, so here's a boring one: stop making the PR the unit of AI work.

These dumps exist because people skipped the planning stage, or did a half-hearted plan and then let the agent write everything in one go. That's a workflow problem, not an AI problem. AI is perfectly capable of doing focused, decomposed work. It just doesn't do that by default, and most teams aren't asking it to.

The workflow that actually works, in my experience, looks like this. Plan first, with the agent, in writing. Decompose the plan into small, independently-shippable tasks. Run those tasks separately, with different agent runs, different commits, different PRs if the work is big enough. Each commit does one thing and is small enough that a human can actually read it. The reviewer gets a sequence of focused changes with a coherent story, not a pile of unrelated changes under one commit message.

This is unsexy. It's slower per-feature than letting the agent freestyle. It will not impress anyone who is counting tokens. But it produces PRs that a senior can review meaningfully in one sitting, and it produces commit history that someone can actually use to debug a regression six months from now.

It also, not coincidentally, looks a lot like how good engineers worked before AI. Plan, decompose, small commits, clear messages. The AI didn't change what good practice looks like. It just made the cost of skipping good practice invisible to anyone who isn't doing the reviewing.

What To Actually Measure

If you're in a position to set metrics, here are some that aren't tokens:

Review load per senior. PRs assigned, lines reviewed, hours spent. The hours part is the catch: it usually relies on self-reporting, which is just more work piled on the people you're trying to protect. PRs and lines you can pull from the git host. Hours you have to ask for, and asking honestly is its own can of worms.
Average PR size. Trending up is a signal that decomposition is breaking down.
Time-to-first-review and time-to-merge. Going up usually means the queue is winning, but read these carefully. In a multi-team setup, a PR can sit in "approved but not merged" for days because another team's piece isn't ready yet. That's a coordination signal, not a review-capacity one. Time-to-first-review is the cleaner of the two.
Revert rate and post-merge bug rate. These are the honest tax on speed. If they climb while the productivity dashboard keeps going up and to the right, the productivity dashboard is measuring the wrong thing.

None of these are perfect. All of them are better than tokens. If you want a more rigorous framework than four bullet points from a blog post, the DORA capabilities model is roughly where I'd point you: classic delivery metrics (lead time, deployment frequency, change failure rate, time to recovery), plus working in small batches as a first-class capability rather than a nice-to-have. It's not a perfect map for the AI era either, but it was built with care, and "build on top of DORA" is a much better starting point than "count tokens and hope."

If you're a developer reading this, the most useful thing you can do for your senior reviewers is to stop opening PRs you wouldn't want to review yourself. Plan the work. Break it up. Make the commits tell a story. Read your own diff before you ask anyone else to.

I should be honest about something. Nobody really knows how to measure this correctly yet. We're still in uncharted territory, and anyone who tells you they've got it figured out is selling something. The folks at the frontier labs occasionally claim they do, but their own products still ship with bugs, they have functionally unlimited tokens to throw at the problem, and every confident proclamation that software development has been "solved" has fallen flat on its face within a few months of being made. The metrics I've listed are guesses, informed by working through this myself. They're better than tokens because almost anything is better than tokens, but they're not the answer. The answer doesn't exist yet.

The companies that are going to handle this transition well are the ones that figure out the obvious thing. Generation and review are two-thirds of the same loop. The other third is feature definition, which is also getting drafted, expanded, and quietly bloated with AI help, but that's a completely different post. For now, the immediate problem is the two-thirds we already have. If you only measure one of them, you optimize for the wrong thing, and the other half eats your seniors.

The companies that don't figure it out will spend the next few years wondering why their best people are quietly checking out, why review queues never go down, why the codebase keeps getting weirder, and why nobody seems to know how anything works anymore.

I've got a feeling I'll be busy either way.

Vibe Designing

Mladen Stepanić — Sun, 19 Apr 2026 19:45:11 +0000

Since Karpathy coined the term vibe coding, some of my non-coding friends have been showing me their work and teasing me that they can now do everything I can. Anthropic just dropped Claude Design. Now it's my turn.

Shelfcritter, the polygon

I've been building Shelfcritter - a book management app - on and off for a while. It's not really about the books. It's my polygon for testing different agents and workflows, letting AI drive parts of the codebase without losing the plot. And it works, mostly. You can add a book, track what you've read, the basics. But every time I opened the thing, I'd wince. The look and feel was off. Here is how it used to look:

So when Claude Design dropped, I knew exactly where it was going first.

Five minutes and 25% of my weekly Design quota

I didn't bother with the tutorial. The UI is intuitive enough that I just started: took a couple of screenshots of the parts that bugged me most, dumped them in, listed the pain points, and asked for two things - a design system and a redesigned flow for adding new books.

Watching it reason was the fun bit. It picked up friction points from the screenshots without my help, then came back with a couple of sharp clarifying questions. I answered. Five minutes later (and about 25% of my weekly Design quota on a Max x5 plan) I had a clickable prototype I actually wanted to use.

Getting the design into the app was the next question. Claude Design has a pile of export options, but for me the obvious move was Claude Code. The handoff is dead simple: Claude Design hands you a prompt with a link, your Claude Code instance fetches it and reads through the design tokens and component hierarchy on its own. No copy-pasting CSS variables. No reverse-engineering spacing from screenshots.

A few prompts later I had a stack of Beads issues, and my parallel-agents workflow did its thing. Shelfcritter ended up with proper typography and a sense of calm to it. A real library theme. Not production-ready yet, but a couple hours in, it was close.

How Shelfcritter got AI Digest

A bit less than two months ago I started running a daily AI digest. The premise was small: keep some friends informed about what's actually moving in the AI world, and test out Cowork for automation while I was at it (I'm too much of a chicken to give OpenClaw all the permissions it wants - in theory it could spend all my money or get in a fight with my boss). The digest was the first idea that came to mind when I went looking for something to put Cowork through its paces, and it stuck.

It all runs as a Cowork scheduled task. The flow: Cowork gathers the data → Obsidian vault → Discord. It checks the day's releases and the AI news, writes the whole thing into the vault with proper tags, and posts it to the channel my friends actually read (or I hope they do). A single prompt in the repo walks Claude through every step. Most days I don't touch any of it. However, it is ugly and one-dimensional. Like an endless pit of information, going to that channel to be lost forever.

The $100 that pushed me into building an archive

Then Anthropic ran a promotion and I picked up a $100 offer. I decided to go a little wild: a public archive site for every digest I've sent and a future ones, so the Discord posts aren't the only place this stuff lives.

I didn't really write the site. The same Claude Code workflow that redesigned Shelfcritter - Beads issues, parallel agents, reviewing one phase at a time - is what actually built it. I scoped it out, fed the agents one phase, reviewed the diffs, course-corrected, next phase. Eight phases in, the site was there.

The stack is small on purpose: Astro on Vercel for the static build, Neon Postgres with pgvector for hybrid search, Jina for embeddings. The Obsidian vault is the source of truth. I write a digest, push to main, and the rest just happens.

Claude Design, warts and all

It's not all roses though. Claude Design is in what Anthropic calls a "research preview". You and I can call it beta, like normal people. And it shows.

Twice, while I was working on the archive site, it ate the search page. Not errored. Not crashed. Just gone. There's no checkpoint system, so it wasn't coming back. I regenerated it from scratch both times. And since regenerating costs money, it ate that too.

A few times it also flat out ignored my instructions. Probably a skill issue on my end, but the odd bit is that the thing it did instead was often good enough that I told Claude Code to ship that instead. Pro or con? Still undecided.

None of this is a dealbreaker for me. This kind of stuff piles up angry users fast though, and I hope Anthropic patches it before it does.

And yes - it burned through $70 of that $100 credit in a blink of an eye.

Are designers cooked?

Resounding no.

I'm an experienced UI developer. That's what let me steer Claude Design to the result I wanted. I could tell when it was drifting and push back. Sometimes I'd even catch it doing something better than what I'd asked for. Plenty of people will get decent results too. But for anything non-trivial, an experienced designer using Claude Design will still have an edge. They'll obliterate everyone else when it comes to web design.

Web. That's the key bit. I also pointed it at another pet project of mine, a Beads explorer in Swift. The results were nowhere near stellar. It ignored Apple's UX conventions completely, and I couldn't get Claude Code to retrofit them either. I ended up with a fine-looking app full of components built from scratch in places I wanted native ones.

Wanna see the digest site? Head to aidigest.shelfcritter.com.

AI Broke the Knowledge Pipeline, Curiosity Can Save It

Mladen Stepanić — Thu, 16 Apr 2026 21:23:16 +0000

The junior developer market is in freefall. If you're early in your career right now, you already know this. Stanford's Digital Economy Lab looked at payroll data from ADP and found a roughly 20% decline in employment for software developers aged 22–25 since late 2022, concentrated in occupations where AI is replacing rather than augmenting human labor. Indeed's Hiring Lab reports that tech postings for junior-level titles are down 34% from five years earlier, while senior postings are down only 19%. The entry-level positions dried up, the bootcamp promises turned out to be lies, and the seniors you were supposed to learn from are too busy arguing about which AI agent to use. Everyone has an opinion on why this happened. Most of them are wrong, or at least incomplete.

Here's what I actually think, after 16 years in this industry and watching this unfold up close.

The Deal Changed, And Nobody Told You

Companies used to hire junior developers for a simple reason: they needed hands. Someone had to write all that code. It didn't matter if those hands belonged to someone who was still figuring things out - the work was defined enough, the tasks were small enough, and the cost was low enough that it made sense. Companies would hire ten juniors on the off-chance that two or three of them would eventually become capable seniors. The rest were interchangeable, and everyone quietly knew it.

AI ended that deal overnight.

The tasks that justified hiring a junior - the boilerplate, the repetitive edits, the scaffolding, the "just type this out" work - are now handled faster and cheaper by a $20 monthly subscription. Companies didn't eliminate junior roles out of malice. They eliminated them because the economic argument for those roles evaporated. What the market wants now is someone who can direct AI, evaluate its output, catch its mistakes, and build things that actually work. That's a medior skill set, minimum. The runway that used to exist — the grace period where you could grow into the job while doing it — is gone.

This is corporate responsibility without the fraud. Companies made a rational economic decision that has irrational long-term consequences for the entire industry. More on that later.

The 10% Problem

In my 16 years, I've worked with a lot of developers. I've hired some of them. And I've noticed a pattern that I don't hear many people say out loud: roughly 10% of the people who call themselves developers are genuine geeks. They want to tear things apart and see how they work. They go down rabbit holes nobody asked them to go down. They get unreasonably interested in why something is slow, or how a network request actually travels from a browser to a database, or what happens at the memory level when you create an object.

The other 90% are not bad people. Many of them are competent. Some of them would have eventually caught the bug — that moment where something clicks and curiosity ignites — if they'd been given enough time and the right environment. But most of them got into programming for the money, or because someone told them it was a safe career, or because it seemed more interesting than accounting. They settle into a role, learn the beaten path, and stay on it. A framework change is a crisis. A domain change is a nightmare. The goal, consciously or not, is to reach retirement without too many surprises.

The 10% were always going to be fine. AI just gave them better tools.

The 90% are the ones caught in the collapse.

My honest opinion about this is: I don't think we can assume the 90% were born less curious. I think a lot of them had their curiosity educated out of them before they ever wrote a line of code.

What the Classroom Did First

I grew up in the Balkans. I remember my high school informatics teacher insisting, with complete confidence, that the fastest CPU available was a 100MHz Pentium. I had a 667MHz processor sitting on my desk at home. I was fifteen.

I was lucky and headstrong enough to know he was wrong and to have an actual, physical proof of him being wrong so I kept the curiosity anyway. But I remember being on the edge - that moment where you start to wonder if maybe you're the one who's confused, if maybe the system knows something you don't. Most kids, faced with that kind of authoritative wrongness, learn the lesson the system actually teaches: don't question, just answer what's on the test.

This isn't unique to the Balkans, and it isn't unique to informatics teachers. Research backs this up. Susan Engel, a developmental psychologist at Williams College, spent months observing elementary school classrooms with the specific goal of studying curiosity in children. What she found was that it was almost impossible to make comparisons because there was such an astonishingly low rate of curiosity in any classroom she visited. The children had simply learned not to bother wondering. As one education researcher put it: too many kids start out as exclamation points and question marks, and leave school as plain periods.

The mechanism isn't malice. It's institutional inertia - the same force that kept my teacher confidently wrong about CPUs while the industry moved on without him. The system rewards correct answers on standardized tests. It punishes questions that don't fit the schedule. It produces students who are very good at performing knowledge and very bad at seeking it.

That's the junior developer who arrives at their first job knowing the syntax but not the why. Who can follow a tutorial but freezes when the tutorial ends. Who is threatened by a framework change because the framework was the knowledge, not the understanding underneath it.

The Curiosity Science

Your instinct might be that curiosity is something you either have or you don't — that the 10% were born that way and the rest were always going to settle. The science is more interesting than that.

Researchers distinguish between two types of curiosity, a framework first articulated by psychologist Daniel Berlyne in the 1950s. Perceptual curiosity is the basic animal impulse toward novelty - the thing that makes a cat investigate a new box. Epistemic curiosity is the human-specific drive to close gaps in understanding, to not be able to let a question go until it's answered. This second type is what separates the developer who reads the error message from the one who understands why the error happened.

Here's the important part: epistemic curiosity is linked to dopamine. A 2014 study from UC Davis demonstrated that curiosity activates the same reward circuitry as food or money, and that information you're curious about is remembered better - mediated by dopaminergic pathways connecting the midbrain to the hippocampus. Your brain rewards you for figuring things out. The more you experience that reward, the more you seek it. But if an environment consistently teaches you that questions are unwelcome, that the answer is what matters and not the understanding, that deviation from the expected path is a risk rather than an opportunity: the reward circuit doesn't fire. And what doesn't fire, eventually doesn't reach for the trigger.

The 10% aren't more curious by birth. They're the ones who, for whatever reason - stubbornness, a good teacher, a computer at home, a parent who encouraged questions - kept getting that dopamine hit despite the system. The curiosity survived because something protected it.

For the 90%, something didn't.

The Broken Pipeline

Here's the part nobody in a position of power wants to say: companies are now paying for a problem they helped create.

The junior developer role wasn't just an economic arrangement. It was the industry's apprenticeship system. It was how experienced developers became mentors. It was how institutional knowledge transferred. It was the mechanism by which someone who was curious but unproven could prove themselves, fail safely, get corrected, and eventually become someone capable of doing the hard things.

That mechanism is gone.

The seniors who remain are stretched thin, mentoring nobody because there's nobody to mentor, and optimizing their own work with AI instead of passing knowledge to the next generation. Companies got what they optimized for - lower costs in the short term - and are quietly building toward a cliff.

The Long Game Nobody Is Talking About

I want to say something bold, and I want to be honest that it's a prediction rather than a certainty.

Big tech is betting everything on AI. The assumption baked into every major tech company's strategy right now is that AI will keep improving fast enough to compensate for the disappearing human pipeline. AGI is treated as an inevitability. The race is on.

Apple's own researchers published findings that should give everyone pause. Their GSM-Symbolic paper, presented at ICLR 2025, found that current AI models - including the most advanced reasoning models from the leading labs - show no evidence of genuine formal reasoning. The behavior is better explained by sophisticated pattern matching, so brittle that simply changing the names or numbers in a problem can produce substantially different answers. A follow-up paper, "The Illusion of Thinking", argued that these models collapse completely at high-complexity tasks (though it's worth noting that paper has attracted methodological pushback over whether the observed "collapse" reflects reasoning limits or just output-token limits). Either way, the underlying critique is not fringe. It's peer-reviewed research from inside one of the largest tech companies in the world.

Then there's the geopolitical precedent that nobody in tech seems to want to learn from. China spent decades supplying the world with cheap manufacturing. Companies optimized hard for that capacity, hollowed out their domestic capability, and built supply chains that assumed the arrangement was permanent. It wasn't. China shifted its pricing politics and its geopolitical posture, and suddenly the companies that had offshored everything were scrambling. The US in particular rushed to bring manufacturing back and discovered that capacity and expertise don't return quickly as you can't just throw money at the problem and have a semiconductor fab running in three years. TSMC's Arizona fab was announced in 2020 and is still ramping in 2025. Intel's Ohio fabs, originally scheduled to start production in 2025, have been delayed to 2030 and 2031. The knowledge pipeline had been broken for too long.

The parallel to AI should be uncomfortable. If the technology shifts, if the economics change, if the geopolitics of compute become complicated, if the model that was cheap becomes expensive: the companies that eliminated their human development pipeline will face the same problem. You can't rebuild a senior developer cohort in three years either. The people who should have been learning are doing something else now.

Add to this the regulatory risk. AI has already attracted the attention of governments in ways that could move fast. The EU AI Act is law, with prohibitions on certain practices already in force and full enforcement arriving in August 2026. The US has started restricting AI chip exports through the BIS AI Diffusion Rule. And in February 2026, the Trump administration ordered every federal agency to stop using Anthropic's Claude - followed weeks later by the Pentagon formally designating Anthropic a "supply chain risk to national security," the first time that label has been applied to an American company rather than a foreign adversary. The trigger wasn't that the AI was too dangerous; it was that Anthropic refused to remove restrictions preventing Claude from being used for mass domestic surveillance and autonomous weapons. The position has softened since - the same administration is now quietly encouraging Wall Street banks to test Anthropic's latest model but the precedent is set, and that's what matters. Labeling certain AI applications a national security risk (which is no longer a theoretical scenario) puts regulatory walls around them overnight, and those walls outlast the political moment that built them. We have watched entire technology categories get constrained by regulation faster than anyone expected. AI is not immune, and the companies betting their entire talent strategy on its continued availability are making a concentrated bet on a highly uncertain variable.

And the electricity problem is real and getting less ignorable. The International Energy Agency projects that electricity consumption by data centers could more than double by 2030, largely driven by AI workloads. Goldman Sachs Research estimates a 165% increase in data center power demand over the same period. The costs are not going down as fast as the hype suggests. At some point, the economics of "just use AI" start to look different than they do today.

If the technology plateaus - or even just slows - companies will look up and discover that they eliminated the junior pipeline, stretched their seniors to the limit, and have nobody coming up behind them. The people who would have spent three years learning the hard things, failing on small problems, building the pattern recognition that makes a senior developer dangerous — those people went into other fields, or gave up, or are doing something else.

What started as a tectonic shift forward could turn out to be a slow-down in the long run. Not because AI wasn't useful. Because the humans who should have been learning alongside it weren't given the chance.

So What Do You Actually Do?

If you're a junior developer reading this, I'm not going to tell you it's fine. It's not fine. The market is genuinely hard right now, and the advice being handed out — "just learn AI tools," "prompt engineering is the future," "build your personal brand" — is mostly noise.

Here's what I actually believe:

If you're wondering what this looks like day-to-day — which fundamentals actually matter, how to use AI without becoming dependent on it, why boring tech is a career asset — I wrote about all of that in After the Panic: A Note for Junior Engineers. The short version: pay attention, don't panic, learn the thing underneath, choose boring tech, use AI as a lever instead of a crutch.

What I want to add here, specifically because this post is about the pipeline problem rather than the individual one:

Build things nobody asked you to. The job market doesn't have patience for "potential" right now. The most effective thing you can do is demonstrate curiosity in concrete form — a side project, an open source contribution, a written breakdown of something you investigated. Not for the resume. Because the act of building something real, for no external reason, is what keeps the curiosity alive when the market is telling you to give up.

Find out which 10% you're in. Not to categorize yourself permanently — but to be honest with yourself about whether the thing that drives you is the craft, or the outcome. Both are legitimate answers. But they lead to different decisions.

The Uncomfortable Ending

The industry created a generation of developers who were trained to type code rather than understand it. The education system before that created students who were trained to produce correct answers rather than ask good questions. Companies hired them because it was economical, then discarded them when something cheaper came along.

Some of those people would have become exceptional engineers, given time. We'll never know, because the time wasn't given.

The companies now racing toward an AI-only future are making a version of the same bet again: that the tool is enough, that the human pipeline is a cost center rather than a source of knowledge, that the short-term optimization is the right one.

Maybe they're right. Maybe the technology gets there.

Or maybe, in ten years, someone will be asking who's going to fix the mess the AI made — and there won't be enough senior developers left who remember how.

I've got a feeling I'll be busy.

After the Panic: A Note for Junior Engineers

Mladen Stepanić — Tue, 14 Apr 2026 21:13:34 +0000

The development world as we know it is at a turning point. It's been at a turning point a lot of times since I've started doing this some 16 years ago, and a lot more times before that. But this one time is special, this one time is different: AI threatens to replace us. Get out of Starbucks and learn to use a shovel.

Relax, I'm dramatic on purpose, we're not going anywhere. I'm writing this post mostly for younger folks who I don't envy, because they're in a position where they are threatened by, scared by, and forced into the AI bubble. People are losing their minds over whether AI will make them obsolete, and they're listening to false prophets who are telling them that they should be learning a real craft, a physical skill that won't be touched by AI any time soon. But what's the reality? Reality is that AI is a tool (last time I checked). A powerful tool, and it could be a great asset to a developer who knows what they're doing. If you don't know about the basics, security, user experience, performance... no amount of AI will help you make a good app. That's the truth, and anyone who's telling you otherwise is likely fearmongering to raise their importance, or trying to sell you an AI-powered service or one of their dime-a-dozen courses. These people are not your friends!

I'm not saying you need to avoid AI until it disappears. On the contrary, it's here to stay, just maybe not in the areas AI doom merchants want you to believe it will. You should definitely learn how to work with AI. Large Language Models (LLMs) changed the game for me: I can get to the prototype faster, I can debug faster. I abstracted away boring multi-file edits, tests, and boilerplate, which are generally time consuming. Do I work more? No. Do I output more? Yes, but not as much as my CEO would like me to. The time I save with AI, I spend prompting that very same AI about the stuff I know the least. Brainstorming, asking it to explain its choices, then cross-referencing against documentation and people who actually know what they're talking about. More often than the doom merchants would like to admit, its choices turn out to be overkill or outright wrong. That's how I'm filling the gaps in my infrastructure and backend knowledge.

It's easy to pay $200+ a month and have Claude or any agent write the app for you, but what happens when you get a data breach and you don't understand the code well enough to fix it? When you can't explain to your users what went wrong because the AI made decisions you never questioned? Will you be able to sell your service when everyone can use that same $200+ subscription to build their own, with the same bugs, the same blind spots, and the same confident nonsense baked in? Is that sustainable?

I don't think so.

"But Mythos is coming! ChatGPT 6! $NEXT_BIG_THING! It's going to obsolete us!"

Good. Let it. Add it to the toolset so it can help you learn, not just crunch code faster.

So, what do you actually do?

If you're a junior engineer and the section above made your stomach turn, good, that was the point. Now take a breath.

Keep your ears to the ground on AI. Keep your hands on the fundamentals.

Use whatever agent you can get your hands on and get a feel for it. That's it. That's the whole strategy. Everything below is just me explaining why it works.

Pay attention, don't panic

You should know what the current models can and can't do. Try the tools. Read the release notes. Notice which parts of your job got easier this quarter and which parts got weirder. That's "ears to the ground", it's awareness, not obsession. You don't need to try every new agent framework that trended on Hacker News last Tuesday. Trust me, I lived through JavaScript framework fatigue: the industry always converges on a handful of choices that actually work, and the rest quietly disappear into blog post graveyards. Most of them will be gone by Tuesday after next, and the ones that survive will still be around when you get to them.

The people sprinting to learn the agent-of-the-day are running on a treadmill that someone else keeps speeding up. Don't join them. Walk. Pick one agent, get comfortable with it, and stop worrying about the rest. The core concepts carry over: context windows, tool calls, prompts, rules files, how the thing behaves when it's unsure. Switching from one agent to another is a weekend of friction at worst, not a career setback.

Learn the thing underneath

Whatever the AI writes for you sits on top of something older and more stable: HTTP, SQL, a filesystem, a process model, a type system, a cache, a queue. When the AI produces something broken (and it will), your ability to fix it depends entirely on whether you understand the layer it was writing into.

So the unglamorous advice: learn how a request actually travels from a browser to your database and back. Learn what an index is and why your query is slow without one. Learn what happens when two users hit the same endpoint at the same time. Learn why your deploy broke at 2 AM. None of this is going out of style. None of it is getting abstracted away, no matter what the keynote says. And yes, it sounds like a lot, but every hour you put into this now pays off twice over once you pair it with AI.

If you know these things, AI becomes a lever. If you don't, it becomes a very confident liability with your name on the commit.

Choose boring tech that works

Here's the part that won't trend anywhere. The best thing you can build your career on is boring technology, the stuff that's been around long enough to be unexciting. Postgres. A cron job. A plain old background worker. SQLite on a single box, serving more traffic than you'd believe. A monolith that fits in one repo and one head.

Boring tech is boring because it works. It has documentation written by humans who used it in anger. It has failure modes that are known, named, and googleable. It doesn't rename itself every six months. And here's the quiet part: AI is dramatically better at boring tech than it is at the shiny stuff, because there are decades of examples to learn from. Using AI on top of a boring stack is where it actually shines.

Meanwhile, the person who built their whole identity on a framework that launched last spring will spend the next year explaining why nothing works, and why nobody can hire for it.

The job is still the job

Users still need software that solves their problem without asking them to set up ten cloud services, babysit backups, or pray that availability holds. Businesses still need people who can look at a system, understand it end to end, and make a judgment call. AI doesn't do judgment. It does confident averages. Sometimes dangerous ones. Judgment is yours to build, and you build it by doing the work: reading code, breaking things, fixing them, asking why. And breaking things again, just to be sure.

Use AI to skip the parts that were always busywork. Use the time it gives you back to get better at the parts that aren't.

Will our lives change? You bet they will. Which way, that's up to you. You need to decide if you'll chase the agent-of-the-day or you'll actually use AI to produce boring solutions that actually work.

Your call.

I got this figured out for myself. I think I'll still have a lifetime of work fixing vibe-coded messes false prophets will inevitably create. It'll likely be boring but, sometimes, boring is good.

Remote Slop with Claude Code

Mladen Stepanić — Fri, 20 Mar 2026 21:42:38 +0000

In my last post on agentic workflows, I talked about workflow engineering — how the pipeline you design around AI matters more than the AI itself. I built a book inventory app that way. Skills, OpenSpecs, beads, parallel agents in tmux. It worked great.

But there was an asterisk I didn't mention: I was sitting at my desk the whole time.

What happens when you're not?

The Setup

Claude Code recently got a Telegram integration. The idea is simple: you control Claude Code from your phone through a Telegram bot. Same skills, same workflow, same project — different interface. If you've used the plugin system at all, setup is straightforward. Documentation walks you through it, you're chatting with your agent in minutes.

In fact, as of today, Anthropic officially shipped this as Claude Code Channels — a plugin-based feature that lets you push messages from Telegram or Discord into a running Claude Code session on your machine. Your session processes the request with full filesystem, MCP, and git access, then replies through the same chat. It's built on MCP, which means it slots into the existing plugin ecosystem cleanly. I ran my experiment the day it launched, so what you're reading is a day-one field report.

I wanted to test this properly, so I gave it a real task. Not a toy. A refactoring sessions on an existing codebase, the kind of thing I'd normally spend a focused afternoon on. Except today, I wasn't at my desk and I wasn't focused. I was doing something else, checking Telegram between other things.

Buckle up.

The Permission Wall

Immediately: a wall.

Claude Code has a permission system. It asks before it does anything potentially destructive — file writes, shell commands, external calls. At your desk, this is fine. You see the prompt, you approve, you move on.

From Telegram? The bot doesn't forward those permission prompts. Your agent hits a permission check, and it just... stops - silently. You're staring at Telegram wondering why nothing is happening, and the answer is that Claude is staring at your terminal wondering why you're not approving anything.

This is the first thing you'll hit, and there's no elegant workaround. Either you solve the permission problem or you don't use this workflow.

The Context Ceiling

Second problem: context. Opus ships with a 1M token context window, which sounds like a lot. And it is — for a focused session. But "entire day, away from your machine, no way to reset" is a different budget. You can't /clear the session from Telegram. If the conversation gets heavy, you can't start fresh. You're stuck with whatever context you've accumulated.

For a day of casual back-and-forth this turned out to be manageable. But it's something you have to plan around, not something you can ignore.

Biting the Bullet

So I did the thing you're not supposed to do: --dangerously-skip-permissions.

I know. The flag name exists for a reason. But here's my reasoning: I wasn't installing new external dependencies. My Claude Code workflow is already scoped — skills are loaded, the project is defined, the agent knows its boundaries.

And it worked. The permission wall disappeared. The agent could actually run.

Not great, not terrible. I'm still not comfortable disabling guardrails as a general practice. For this specific experiment, with this specific setup, it was a calculated risk. Use at your own discretion. Or better — don't. Anyway, don't blame me if Claude decides your main branch needs to have a different history and you don't have branch protection in place. You've been warned.

Saving Context With Agent Teams

The context problem needed a different solution. If I can't clear the session, I need to use less of it.

This is where the workflow from my previous post paid off. If you haven't read it: I use a pipeline where work gets decomposed into beads — small, focused tasks based on Steve Yegge's beads concept. Each bead is scoped tightly enough that a sub-agent can pick it up and run with it without needing the full conversation history.

So I instructed the main agent to delegate aggressively. The pipeline looked like this:

I describe what I want via Telegram
Main agent runs a spec creator sub-agent which generates an OpenSpec (a structured definition of the change) and makes a draft PR on GitHub
I review the spec from my phone and approve
Spec gets decomposed into beads
Sub-agents pick up beads, implement, commit, and update the PR

The main session barely touched any implementation detail. It just coordinated. By the end of the day, I'd used about 30% of the context window. Granted, I wasn't going crazy with requests — but the pattern held up.

Why Not Claude Code on the Web?

Fair question. Claude Code has a web interface now. I could've used that from my phone. No Telegram bot, no permission hacks.

The answer is boring: my tools. My skills are loaded locally. My workflow is configured. I have Playwright set up for visual verification — I literally had the agent screenshot pages to confirm layout changes actually landed. That's not something you get from the web interface.

When you've invested in a workshop, you want to use it. Even from a distance.

The Slow Loop

Here's the real downside, and it's not about permissions or context.

The feedback loop is slow.

At my desk, I have hot reload. I change something, I see it instantly. From Telegram, the loop is: agent pushes to GitHub → Vercel builds a preview → I check the preview on my phone. That's minutes, not milliseconds. For layout work especially, it's painful. You're doing the development equivalent of texting someone in the next room instead of just talking to them.

I could live with it because I was mostly multitasking — checking in on the agent between other things. But if this were my primary way of working? The latency would get to me.

The Numbers

I had Claude Code analyze the session after the fact and here's what a day of remote agent work actually looks like:

Overview

Metric	Value
Total tool calls	507
Agents spawned	22 across 5 teams
PRs created / merged	5 / 5
Telegram messages	34
Playwright interactions	50
OpenSpec tasks verified	76

Agents

Role	Count
QA	4
Spec Writer	3
Frontend	3
Explorer	3
Backend	1
Reviewers	3
Standalone	9

Features

Feature	Status
Crop inset fix	merged
Remove impressum crop	merged
Scrollable mapping	merged
Reading status	in review
OpenSpec cleanup (3 lists)	76 tasks

507 tool calls, 22 agents, 34 Telegram messages, 5 PRs. From my phone.

The Verdict

At the end of the day, I sat down and did what I always do: reviewed the code myself. Read every change, checked every decision. And it was fine. A few minor optimizations I would've caught in real-time at my desk, but nothing structural. Nothing that made me regret the experiment.

So here's the honest scorecard:

What works: Your full workflow, from Telegram. Skills, beads, agent teams, even Playwright screenshots. If you've built a good pipeline, it travels.

What doesn't: Permission prompts don't reach you. Context can't be reset. The feedback loop trades seconds for minutes. And you'll probably end up running with --dangerously-skip-permissions, which is exactly as comfortable as it sounds.

Who is this for: Someone who's already set up their Claude Code workflow and wants to keep things moving while away from their desk. Not as a primary dev environment — as an extension of one you already trust.

Would I do it again? Yeah, probably. But I'd plan the work differently. Bigger, well-defined refactorings that don't need rapid visual feedback. The kind of work where you can fire and forget, then review later. Not pixel-pushing. Not exploratory coding. Structured changes through a structured pipeline.

Remote slop? A little. But manageable slop, with a review step at the end. And sometimes that's enough.

Stop Repeating Yourself. Stop Repeating Yourself. No, Seriously — Put It in a Skill.

Mladen Stepanić — Sat, 07 Mar 2026 14:24:11 +0000

In my last post I talked about how the workflow is the work, how designing the pipeline matters more than any individual prompt. I still believe that. But I've now hit the next layer of the onion: what happens when the pipeline itself becomes repetitive?

I've been using Claude Code across five projects. A React + Hono book inventory app, a legacy .NET modernization, a Swift macOS menu bar utility, a Rust audio DAW, and a Rust TUI for task visualization. Different languages, different domains, apparently very similar habits. And I didn't notice until I asked.

The Prompt That Started It

Credit where it's due. Chintan Turakhia posted a tweet, "Run this prompt frequently. You're welcome.", with a screenshot of a Claude Code prompt:

scrape all of my claude sessions on this computer. give me a breakdown
of all the things i do, things that are worth making into skills vs
plugins vs agents vs claude.md

So I did. My version had typos, his didn't, but the idea was the same: ask Claude Code to introspect on itself. Read through every session I'd ever had and find the patterns I couldn't see because I was too close to them.

What Claude Found

Claude spawned three subagents in parallel. One explored my global config and plugin structure. Another crawled across all my repos reading every CLAUDE.md and AGENTS.md file. The third dug into specific project architectures and specs. In about a minute, they'd collectively mapped my entire Claude Code ecosystem.

The findings were equal parts validating and embarrassing.

Validating because I clearly had strong workflows. I'd organically developed a feature planning pipeline, openspec proposal, beads decomposition, worktree, implement, merge, clean up. I had a sophisticated agent team pattern with typed agents and context-aware respawning. I had a review pipeline with dedicated expert agents for security, architecture, and code quality. Real workflows. Stuff that worked.

Embarrassing because I was re-stating the same ground rules in virtually every session. Never use Python, always use Bun, use Claude Code's native tools instead of sed and awk, agents must commit frequently but not step on each other, all features start with an openspec. These rules were scattered across AGENTS.md files in every project, re-stated inline whenever Claude forgot, and occasionally contradicting each other between repos. I was spending real tokens and real time repeating myself to an LLM. The irony of a "workflow engineer" with a messy workshop is not lost on me.

The Pivot

I looked at the analysis and immediately decided to act on it. Same session, no break:

let's do the following:
- add recommended things to global claude.md and remove them from projects
- create a new project in ~/repos/ to host all of the recommended plugins
- implement openspec-and-beads skill
- implement agent-team skill
- write a comprehensive readme doc with implemented skills and a list of todo items

Do not connect any of the new skills, I'll make a github repo and publish them there

This is where the session pivoted from analysis to execution. And here's the thing, this pivot itself followed the pattern Claude had just identified. Analysis, planning, execution. The openspec-and-beads workflow, applied to itself. Recursive in the best possible way.

What Got Built

Claude entered plan mode, loaded two meta-skills, executing-plans and writing-skills, and broke the work into three batches.

First batch was the lowest-hanging fruit: creating a global ~/.claude/CLAUDE.md with all my universal conventions, then surgically removing the duplicated rules from each project's config. Every repo got trimmed to only project-specific information. One file, in one place, read by every future Claude session. No more "never use Python" on repeat.

Second batch was openspec integration. Instead of embedding massive documentation inline in every project, each repo got a one-liner pointing to the new skill.

Third batch was the main event: the claude-skills repository with three complete skills.

The openspec-and-beads skill formalized my most-used workflow. Gather project context, scaffold a change proposal with motivation and delta specs, decompose into prioritized beads linked to an epic, track during implementation, archive when done. Before this existed as a skill, I was re-explaining the concept every time I started a new feature. Now its a single invocation.

The agent-team skill captured something more subtle, the coordination patterns for running multiple agents in parallel. The key insight was the "cover agent" pattern: when an agent hits roughly 50% of its context window, you let it finish its current task, create a new bead for the remaining work, and respawn a fresh agent to pick it up. The skill also codifies rules I'd learned the hard way. One agent per file to avoid merge conflicts. Commit at every logical checkpoint. Shut down in reverse dependency order.

The use-bun skill was just a quick reference card. A lookup table for Bun equivalents of common Node/npm patterns. bunx tsc --noEmit instead of npx tsc, Bun.file() instead of fs.readFile. Tiny, but it eliminated a whole class of "how do I do X with Bun again?" questions.

The TDD Override

Here's a moment that stuck with me. The writing-skills meta-skill recommended full TDD for new skills, write tests, watch them fail, implement. Claude's response was pragmatic: this content was already empirically validated across hundreds of messages and 240 sessions. The "tests" had already been run, organically, over weeks of real usage. It proceeded directly to writing the specification.

That felt like a right call. TDD for a skill isn't the same as TDD for code. The validation had already happened in practice. The whole point of this excercise was to capture what was already working, not to discover new behavior. Sometimes the best test suite is "I did this 50 times and it works."

The Structure

The repo ended up looking like this:

~/repos/claude-skills/
├── README.md
├── plugin.json
├── marketplace.json
├── skills/
│   ├── openspec-and-beads/
│   │   └── SKILL.md
│   ├── agent-team/
│   │   └── SKILL.md
│   └── use-bun/
│       └── SKILL.md
└── plugins/
    └── README.md

There was a brief moment of confusion about plugin manifests, plugin.json vs marketplace.json, what each needed, but it resolved quickly. The repo was initialized, committed, and pushed to GitHub in the same session.

What This Actually Means

Your LLM conversations are data. 240 sessions contained patterns I couldn't see because I was living inside them. Having Claude analyze its own session history was like running a profiler on your own workflow, the hot paths become obvious. If you haven't done this yet, do it. Chintan was right.

Global config eliminates an entire class of wasted tokens. Every "never use Python" I typed was burning context and attention. A single CLAUDE.md in ~/.claude/ fixed that permanently. If you're using Claude Code across multiple projects, this is probably the highest-ROI thing you can do right now.

Skills are just formalized habits. I didn't settle on the openspec-and-beads workflow during this session. I'd been doing it for weeks. The skill just wrote down what was already true. If you find yourself explaining the same process to Claude more than twice, it belongs in a skill. Not a prompt, not a CLAUDE.md entry, a skill. The distinction matters because skills carry context, structure, and sequencing that a flat config file can't.

The "cover agent" pattern deserves to be a first-class feature. The idea of monitoring an agent's context utilization and strategically respawning it before it degrades, assigning the remaining work as a new task, is something I'm doing manually. In 2026. While Anthropic ships yet another benchmark blog post. The fact that I have to write a skill to manage context windows because the tool won't do it for me is... a choice. Anthropic, if you're reading this: please steal this idea. I'm begging you.

What's Next

The TODO list from this session is still open. A beads MCP plugin so agents can query and update task status natively instead of shelling out to bd commands. A review pipeline skill for spawning expert agents and turning their findings into follow-up beads. A branch cleanup skill because the merge-delete-remove dance is identical every single time and I'm tired of typing it.

The session that produced all of this took about 22 minutes. Three subagents, a lot of pattern recognition, and a pivot from "hmm, I wonder what I actually do" to shipping a plugin repo on GitHub. Not bad for a Saturday morning.

Workflow Engineering > Prompt Engineering

Mladen Stepanić — Sun, 01 Mar 2026 10:42:02 +0000

... it's early 2026. Remember when I said AI is a tool? I still believe that. But I've been using Claude Code for a few months now and I need to update the nuance a bit: AI is a tool, but the way you set up the workshop matters more than the tool itself.

I built a book inventory app. Nothing fancy, it tracks books across households, lets users invite others to share their collections. Hono on the backend, React with Vite on the frontend, Neon Postgres for the database, deployed on Vercel. A boring stack for a boring app. And I mean that as a compliment. Oh, and before you ask - no, it doesn't have any AI-powered features. No "smart recommendations," no "AI-curated reading lists." It's a CRUD app. It stores books. The irony is not lost on me.

But the way I built it? That part wasn't boring at all.

The Codex Chapter

I started with Codex. I had high hopes. Same skills loaded, same setup, same project. Progress was slow. Not because Codex is bad, it's not, but because the ergonomics didn't work for me. Worktrees felt awkward. Parallel agents were running but I couldn't see what they were doing well enough to react in time. I was spending more energy managing the tool than building the app. That said, this could easily be a skill issue on my part. Codex might click better for someone with a different workflow or habits, I'm not here to tell you it's a bad tool. It just wasn't the right fit for how I work.

So I switched to Claude Code. And things exploded.

Same skills. Same project. Different interface. Claude Code's TUI let me see parallel agents running in tmux, let me react fast, let me stay in the flow. Thats it. That's the difference. Not smarter AI, not better models, better ergonomics. If you take one thing from this post: capability is table stakes. The interface determines your productivity.

The Workflow Is the Skill

Here's where it gets interesting. I didn't just open Claude Code and say "build me a book inventory app." That's how you get a mess.

Instead, I designed a pipeline. Every feature goes through the same stages:

First, brainstorming. I use obra's superpowers skill, specifically the brainstorming mode, for the planning phase. It made a noticeable jump in the quality of definitions compared to vanilla planning. The output here isn't code, it's clarity about what I'm building and why.

Then, specification. The planning phase generates an OpenSpec, a structured definition of what needs to be built. Still no code.

Then, decomposition. The spec gets broken into beads (if you haven't seen Steve Yegge's beads repo on GitHub, go look). Each bead is a tight, focused task. This is where the magic starts, because beads keep everything scoped. No context window bloat, no agents wandering off into tangents.

Then, implementation. This is where I hand the OpenSpec and the beads to Claude Code and say: go. Parallel agents in tmux pick up the tasks and run.

The One-Shot That Wasn't

Let me tell you about the household feature. Users living in the same space should have access to all books in that space. And I needed an invitation system so the household creator could invite others.

I kicked off the pipeline. Brainstorming produced a clean definition. OpenSpec captured the full scope. Beads broke it into tasks. I handed it to Claude Code and the parallel agents basically one-shotted the whole thing, household concept and invitations, implemented and working.

Sounds impressive, right?

But was it really a one-shot? The implementation was, sure. But I front-loaded the intelligence into planning, specification, and decomposition. The "shot" landed cleanly because the planning was rigorous. Take away the pipeline and ask Claude Code to "implement household sharing with invitations" cold? You'll get something. Whether you'll get something good is a another question.

This mirrors how experienced developers actually work. Nobody good just starts coding a multi-faceted feature. You think it through, you break it down, then you execute. I just happened to have AI on both sides, doing the thinking and the executing. My job was designing the workflow between them.

Three Layers of "Does It Actually Work?"

I can hear the skeptics: "Sure, AI generated the code. But does it work?"

Fair question. Here's my answer: TDD is in place from the start. Yes, through a skill. Tests exist before implementation, not as an afterthought.

But tests only tell you the logic is correct. So I also use a Playwright skill with Chrome to watch actual end-to-end runs. I see the app doing what it's supposed to do. No manual clicking through screens, no "I think it works." I watch it work.

And then, at the end of each meaningful session, I spawn dedicated reviewer agents, one for frontend, one for backend, one for security. Their findings go into a follow-up PR.

Three layers: TDD catches logic errors. Playwright catches visual and integration errors. Reviewers catch architectural and security issues. None of them manual.

And then there's the fourth layer: me. I do a thorough manual code review after all of this. AI catches a lot, but I still read the code myself. I need to understand what's in my codebase, I need to know why decisions were made, and I need to catch the things that automated tools miss. The subtle logic that's technically correct but wrong for the product, the naming that'll confuse me in three months, the architectural drift that no linter will flag. If you skip this step, you're not building software, you're accumulating code you don't own.

The Cross-Model Twist

Here's something that might raise eyebrows: I've been experimenting with using Codex as my reviewer.

Yes, that Codex. The one I moved away from for building. Turns out, Codex 5.3 with maxed out thinking produces genuinely valuable review feedback. And it makes sense when you think about it, review is a different cognitive task than generation. You're evaluating against criteria, not creating from scratch. Codex's deep thinking mode suits that well. The ergonomics that frustrated me during building don't matter for review because it's single-threaded, focused work.

I'm not loyal to one tool. I'm assembling the best pipeline I can from whatever works. Claude Code plans, specs, decomposes, and implements. Codex reviews. Each playing to its strengths.

What This Actually Means

In my last post I said AI is a tool and that you need to know what you're doing to use it well. I stand by that. But now I'd add: the emerging skill isn't coding, and it isn't prompting either. It's pipeline design.

Knowing which skills to load, when to brainstorm vs. spec vs. decompose, when to run agents in parallel, when to bring in a different model entirely, that's the craft now. The code is the output. The workflow is the work.

Will this change again in six months? Probably. But the principle won't: understand what you're building, design a process that keeps quality high, and use whatever tools make that process smooth. Boring? Maybe. But boring apps that work are what people actually need.

And I still think I'll have a lifetime of work fixing vibe-coded messes. Some things don't change.

AI Won't Take Your Job (But Fear Might)

Mladen Stepanić — Mon, 08 Dec 2025 15:10:50 +0000

... it's the end of 2025. And the development world as we know it is at turning point. It's been at turning point a lot of times since I've started doing this some 16 years ago, and a lot more times before that. But this one time is special, this one time is different: AI threatens to replace us.

Relax, I'm dramatic on purpose, we're not going anywhere. I'm writing this post mostly for younger folks who I don't envy because they're in a position where they are threatened by, scared by, and forced into the AI bubble. People are losing their minds over whether AI will make them obsolete, they're listening to false prophets who are telling them that they should be learning a real craft, a physical skill that won't be touched by AI any time soon. But what's the reality? Reality is that AI is a tool (last time I checked). A powerful tool and it could be a great asset to a developer who knows what they're doing. If you don't know about the basics, security, user experience, performance... no amount of AI will help you make a good app. That's the truth, and anyone who's telling you otherwise is likely fearmongering to raise their importance or to sell you an AI-powered service or one of their dime-a-dozen courses. These people are not your friends!

I'm not saying you need to avoid AI until it disappears. On the contrary, it's here to stay - just maybe not in the areas AI doom merchants want you to believe it will. You should definitely learn how to work with AI. Large Language Models (LLMs) changed the game for me: I can get to the prototype faster, I can debug faster. I abstracted away boring multi-file edits, tests, and boilerplate, which are generally time consuming. Do I work more? No. Do I output more? Yes, but probably not as much as my CEO would like me to. I use it to find myself the time to learn core stuff that I'm missing, stuff that AI maybe knows but is unsure how to apply it (or apply correctly). I'm improving my infrastructure and backend knowledge, I use AI to brainstorm and then have it explain the choices it made. Then I question its choices, which are often either overkill or outright wrong. And this happens more times than any of the doom merchants would like to admit.

I don't think so.

Instead, people will still need well-thought-out software that they can use without a hassle of setting up 10 cloud services, or worrying about backups, availability, or disaster recovery.

Will our lives change?
You bet they will. Which way - that's up to you. You need to decide if you'll chase the agent-of-the-day or you'll use the AI to produce boring solutions that actually work.

Your call.

I got this figured out for myself. I think I'll have a lifetime of work fixing vibe-coded messes false prophets will inevitably create. It'll likely be boring but, sometimes, boring is good.