Forem: Michael Nash

Claude Code: The Token Tax

Michael Nash — Thu, 12 Feb 2026 09:00:04 +0000

Part 3 was supposed to be me walking you through building a real project with this workflow. The "build with me" post where I show you how it all comes together on a production app.

That's not what happened.

Opus 4.6 dropped last week. I switched to it immediately, and my output tripled – speed and quality, genuinely night and day. But at 3x the cost. That's when I realised: if I'm going to run the most expensive model, I need to make sure I'm not wasting a single token getting there. So, I did what any sensible person would do. I paused everything, tore my entire workflow apart, and rebuilt it from scratch.

This proved too valuable not to share. So here we are.

I Followed My Own Advice (And That Was the Problem)

Remember Part 2? I told you: "Every time Claude does something stupid, I add a rule. That's how the file evolves."

Solid advice. I stand by it. But here's what I didn't anticipate: what happens when you follow that advice religiously – not whilst building a fun little side project, but something ready for production.

My GLOBAL_STANDARDS file ballooned past 500 lines and kept growing. Not just from me manually adding rules – I'd also been asking Claude to iterate on it, to learn from what it had done right or wrong, and to update the standards accordingly. Sounds smart, right? Claude updating its own rulebook based on experience?

In theory, brilliant. In practice, I'd created a document that was teaching Claude things it already knows, repeating itself in slightly different ways, and growing with absolutely no ceiling in sight. The commands got verbose. Design docs loaded into every single task. The workflow was producing exactly what I wanted.

And then I actually looked at what it was costing me.

The Wake-Up Call

I'd been hitting my context window for the day quite quickly – then for the week. So, I switched to token billing as an experiment, just to see what was actually happening under the hood.

31,500 tokens consumed before Claude wrote a single line of code.

That's not the work. That's the startup cost. All the context loading, standards reading, file searching that happens before any actual development begins. Those 31,500 tokens? About a quid. Just to get started.

Then I ran /address-review to handle some PR feedback – refactoring, package installation, and a few iterations. Three quid. For reviewing comments. Using Opus 4.6 felt like popping down to the shops in a Ferrari – technically impressive, financially idiotic.

Before Opus 4.6 made everything simultaneously better and more expensive, I knew I had to optimise what I already had. Because if I'm going to run the most expensive model available, I need to make absolutely certain I'm not wasting a single token getting there.

What Was Actually Happening

Every time I ran a /ship command, here's what Claude was doing:

CLAUDE.md gets auto-loaded by Claude Code (319 lines) – can't avoid this, it's how the system works
My /plan command explicitly tells Claude to read CLAUDE.md again – completely redundant, it's already loaded
Then it reads GLOBAL_STANDARDS.md (over 500 lines, still growing) – every single time
Then the design system docs (557 lines) – even for backend work
Then design patterns (540 lines) – even for database tasks

Over 15,000 tokens of documentation loaded before Claude even looks at the GitHub issue. Then add the issue itself, codebase search, related files, past PR analysis – and you're sitting at 31,500 tokens before work begins.

And here's the bit that properly stung: most of those 500+ lines in GLOBAL_STANDARDS were teaching Claude things it already knows.

SOLID principles? TypeScript strict mode? React hooks patterns? All in the training data. I just didn't trust it at first. I was still treating it like ChatGPT in '23 – spelling out every little thing because I didn't believe it knew what I meant. But Claude doesn't need you to explain what composition over inheritance is. It needs you to tell it that you prefer it. Ten words instead of two hundred.

Asking the Engine What It's Doing Wrong

First, I read a bunch of articles. Watched a few videos. Dug into Anthropic's documentation and community findings. Built my own agenda.

But then I thought – what better way to optimise than to ask the nearly-sentient engine itself what the fuck it's doing wrong?

I fed Claude my entire setup: the commands, the standards file, the project structure. Gave it everything I'd gathered from Anthropic's own best practices, community benchmarks, the lot. Asked it to audit my workflow and build me an optimisation plan.

What came back was humbling. Not because it was genius – but because the problems were obvious once someone pointed them out.

The chained commands were re-referencing files that were already loaded. The standards file was lecturing Claude on things in its training data. Design docs were loading for backend work because I'd never added a condition to check whether the task actually involved UI. I'd built a system with circular references, redundant reads, and zero awareness of what was actually needed versus what was always dumped in.

I'd been so focused on guardrails – so determined to stop this thing going rogue (mainly because of past experience losing my mind at AI tools doing stupid shit) – that I'd pushed it into checking the same things three different ways. Some of it was oversight on my part. Some of it was stuff I only understood later, once I could see the token costs laid bare.

Digging Deeper Into What Claude Code Can Actually Do

Here's the thing – I'd been using Claude Code's command system since I started, and it worked. But once I started optimising, I dug deeper into what was actually available and realised I'd been ignoring features that solve exactly the problems I was brute-forcing.

Skills, hooks, sub-agents, model routing – these aren't new; they've been there since late last year. But they seemed either trivial or just not relevant when I first glanced at them. Who needs hooks when you can just tell Claude to run Prettier? Who needs sub-agents for simple tasks?

You do. The moment you start hitting your context window regularly and watching your credits evaporate, suddenly these features become essential.

Skills replaced commands. Same markdown files, but with frontmatter that controls behaviour – which model to use, whether to fork into a sub-agent, which tools are allowed. My old commands were flat scripts. Skills are configurable.

Hooks handle the shit I was wasting tokens on. PostToolUse hooks can auto-format after every edit. Claude never sees a Prettier error now – it gets fixed silently before Claude even knows it existed. No more loops where Opus burns thirty pence trying to fix a semicolon. That alone was a game-changer.

Sub-agents let you route the right brain to the right task. Opus for planning and architecture. Sonnet for implementation – roughly 5x cheaper and more than capable of the typing work. Haiku for fast codebase search. Instead of running everything through the most expensive model, you match the brain to the job.

I built six specialised review agents – dependencies, performance, security, bundle size, git history, and final optimisation pass. They all run on Sonnet or Haiku. They execute in parallel, aggregate their findings, and catch issues before PR creation. Faster, cheaper, more thorough than having Opus handle everything. Cost? Negligible. Value? Caught 50% more bugs pre-PR.

Model routing via opusplan – Opus handles the thinking; Sonnet handles the execution, automatically. No manual switching.

These aren't nice-to-haves. They're the features that made a lean architecture possible. Without hooks, I'd still be burning tokens on lint loops. Without skills, I'd still be loading everything upfront. Without model routing, every task would cost Opus prices regardless of complexity.

The Rebuild

I tore it all down and rebuilt it lean. One principle: load exactly what's needed, when it's needed.

CLAUDE.md went from 319 lines to 61. Deleted code examples, verbose patterns, and data model definitions. Replaced with pointers to other files using Claude Code's @import syntax. Working on data layer stuff? Only architecture.md loads. UI work? Only then does ui.md get pulled in.

GLOBAL_STANDARDS.md got decomposed and deleted. Split into five small, targeted files in .claude/standards/: architecture (28 lines), code quality (29 lines), UI (31 lines), testing (25 lines), git conventions (22 lines). Total: 135 lines across five files, loaded on demand.

I'd discussed with people that if my standards file got too big, I'd split it out. That's exactly what happened. But what I didn't expect was how much I could delete entirely. About 350 lines of things Claude inherently knows – SOLID explanations, DRY explanations, TypeScript strict mode examples. Gone. You're not teaching Claude TypeScript. You're declaring your preferences. Big difference.

The learnings system replaced verbose explanations. Sounds smart to let Claude update its own rulebook based on experience, and it is – but I was doing it wrong. Instead of verbose explanations about why something's an anti-pattern, I built a compact learnings file: 10-word bullets on what went wrong, not 200 words on why it's wrong. Lives in its own .claude/state/learnings.md, doesn't bloat the global standards, survives compaction.

The system's smart about it too – automated keyword matching with a 70% similarity threshold catches duplicates before they get added. If a new learning is substantially similar to an existing one, it prompts for a merge rather than creating noise. No manual deduplication faff. And it only captures learnings from meaningful PRs – 100+ lines changed, 5+ files touched, or manually tagged. Trivial PRs get skipped. Result? 70% fewer learning extractions, better signal-to-noise ratio.

Commands became skills with just-in-time context. The old /plan command always read 1,100 lines of design docs. Even for backend work. Even for database migrations. New version: "IF this involves UI changes, read design docs. IF not, don't." That alone saves roughly 8,000 tokens per non-UI plan.

But I took it further. Context thresholds are now dynamic based on what phase you're in and how much work remains. Planning needs more headroom (60% threshold) because you're exploring the codebase. Building can run tighter (70%) because it's more predictable. Review phase? You can push to 80% safely because it's mostly reading. And if you're 7 steps into a 10-step plan, the system adjusts upward because there's less unknown work ahead. Smart scaling instead of arbitrary limits.

PostToolUse hooks for auto-formatting. Every file edit gets Prettier to run against it silently. Claude never sees formatting errors. No more lint loops where it burns money trying to fix a missing semicolon.

The Two-Strike Rule. If Claude fails to fix a build error after 2 attempts, it stops and asks me for guidance. No more infinite loops where it burns money trying to fix something it clearly doesn't understand. Token-looping is the new memory leak – catch it early or it'll bleed you dry.

Context monitoring with emergency handlers. Three-tier system: warning at 70% (controlled compact with state preservation), emergency save at 90% (everything critical gets backed up – uncommitted changes, merge conflicts, untracked files, stashes, the lot), SessionStart recovery (pick up exactly where you left off if interrupted within the last 2 hours). Never lose work to context overflow again. The emergency handler uses git porcelain to capture everything – even edge cases like active merge conflicts or untracked files that a simple git diff would miss.

Visual QA that doesn't bankrupt you. I integrated Claude's Chrome extension for visual regression testing but made it smart. It only runs when UI files change significantly – new components or 10+ lines modified in UI files. When it does run, it has severity levels: critical issues (3+ layout shifts, broken flows) block PR creation and trigger an automated fix agent. High-severity issues (console errors, missing elements) prompt for manual decisions. Low-severity stuff (minor spacing, colours) just gets logged. Reduces the cost fivefold compared to running it on every ship. Value? Catches visual regressions before they hit production.

And here's something I found in Anthropic's own research that validated the whole exercise: instruction-following quality decreases as instruction count rises. LLMs can reliably follow about 150–200 instructions. Beyond that, compliance drops. My 500+ line standards file? Claude was ignoring half of it. I was paying more for worse results.

The Results

Same workflow. Same guardrails. Same checkpoints. But now with six specialised agents catching more bugs, visual regression testing on UI changes, automated learning capture, and dynamic context management. For about 19p per ship – including all the new quality improvements – I'm catching 50% more issues before they hit PR review.

That's the trade-off that makes sense. Not "how cheap can I make this," but "what's the optimal balance of cost and quality." And now I can run Opus 4.6 – the best model I've used – without wincing at the bill.

What I Actually Learned

This whole exercise was pure evidence of something I keep banging on about: learning comes first. Scrutinise your own approach. Refactor, refactor, refactor. Especially when new capabilities are available that you haven't explored yet.

I'd overengineered my own system. Not because the thinking in Parts 1 and 2 was wrong – the principles are sound. You need guardrails. You need a plan-first approach. You need human-in-the-loop approval. All of that still holds.

But I'd applied those principles when a more precise approach would do. The standards file grew unchecked. The commands duplicated work. The context loading had no awareness of what was actually relevant. Just-in-time context matters – make sure you aren't overfeeding this thing, because it'll eat everything you put in front of it and charge you for the privilege.

Here's what to take away if you're running any AI coding workflow:

Check your always-loaded context. Run whatever cost-tracking your tool offers and see what you're actually burning just to get started. CLAUDE.md plus any files loaded by every command – that's your tax. Mine was 1,377 lines. Yours is probably similar. I built a simple cost reporter that aggregates logs and shows me actual spend per ship. Visibility matters.
Stop teaching Claude what it already knows. SOLID, DRY, YAGNI, TypeScript patterns, React hooks – it's all in the training data. Declare your preferences in 10 words, not 200. Delete the lectures. You're not onboarding a junior developer who's never seen JavaScript. You're configuring a tool that already knows the fundamentals.
Never re-read auto-loaded files. CLAUDE.md gets loaded automatically. Every command that says "Read project CLAUDE.md" is double-loading it. Once is enough. Check your commands for redundant reads.
Split monolithic standards into domain-specific files. Architecture, UI, testing, git – separate concerns, loaded on demand. Same principle as good code: single responsibility. My 500+ line monolith became 135 lines across five files. Easier to maintain, cheaper to load.
Add just-in-time guards to everything. "IF this involves UI, THEN read design docs." Conditional context loading saves thousands of tokens per task.
Use hooks for formatting and linting. PostToolUse hooks running Prettier after every edit eliminates an entire class of expensive loops. Never let an LLM waste tokens fixing a semicolon. Handle it at the tooling layer.
Route the right model to the right task. Opus for planning and architecture. Sonnet for implementation. Haiku for codebase search. Match the tool to the job. My specialised review agents all run on Sonnet or Haiku – cheaper, faster, just as effective for focused tasks.
Enforce the Two-Strike Rule. If Claude can't fix a build error in two attempts, it stops and asks you. Token-looping will bleed you dry. Catch it early, intervene manually, move on.
Build a learnings system that doesn't bloat standards. Capture what went wrong in compact bullets (10 words, not 200), store separately from global standards, reference on-demand. Automate deduplication with keyword matching. Only extract learnings from meaningful PRs (100+ lines, 5+ files, or manually tagged). Knowledge retention without context bloat.
Implement dynamic context monitoring with safety nets. Adjust thresholds based on phase complexity and remaining work. Planning needs more headroom than review. Almost done with implementation? You can push closer to the limit safely. Three-tier approach: early warning (60–70% depending on phase), emergency save (90%), SessionStart recovery. Never lose work to context overflow.
Use specialised agents for review tasks. Six cheap Sonnet/Haiku agents running specific checks (dependencies, performance, security, bundle size, git history, optimisation) catch more bugs than one expensive Opus review. They run in parallel, degrade gracefully if some fail (0–1 failures? Continue. 4+ failures? Something is broken, abort), and aggregate reports with conflict resolution. Faster, cheaper, and more thorough.
Make visual QA smart, not wasteful. Only run it when UI files change significantly. Use severity levels to decide what's blocking vs. what's informational. Critical regressions trigger automated fix agents. Minor spacing differences just get logged.
Keep a rollback escape hatch. When you're rebuilding a working system, back up the old version. Keep all your original versions as variants during the transition. Test the new approach on small issues first and validate before committing fully. If something breaks, you can revert immediately.
Treat your AI setup like code. It needs refactoring. It accumulates tech debt. New capabilities make old patterns obsolete. Review it with the same rigour you'd review a pull request. I optimised mine after seeing the costs. You should probably do the same.

The Uncomfortable Bit

If my setup was bloated – and I'd specifically built it to be efficient – yours probably is too.

Every AI workflow has a startup tax. Most people just pay it without realising. And the cost isn't only money. It's degraded performance. You're loading instructions that Claude can't follow, paying for context it ignores, and wondering why the output isn't consistent.

The developers winning with AI aren't the ones using it most. They're the ones thinking about how they use it. Lean context. Just-in-time loading. Model routing. Circuit breakers. Specialised agents. Dynamic thresholds. Visual regression testing. Automated learning capture. These aren't advanced optimisations. They're fundamentals.

If you're burning a quid per command just to get started and accepting that as "the cost of AI," someone else is doing the same work for a fraction of the cost, catching more bugs, and getting better results.

The question isn't whether you should optimise. It's whether you can afford not to.

I'm Mike Nash, an Independent Technical Consultant who's spent 10+ years building software, leading engineering teams, and apparently optimising AI workflows because Opus 4.6 made me look at the bill.

These days I help companies solve gnarly technical problems, modernise legacy systems, and build software that doesn't fall over – or burn money unnecessarily.

Want to see the full setup, chat about AI optimisation, or swap war stories? Drop a comment or connect on LinkedIn.

Claude Code: Why Context Is Everything

Michael Nash — Thu, 05 Feb 2026 12:00:00 +0000

In Part 1, I talked about why we need guardrails for AI development. Now let's build them.

Fair warning: this isn't some comprehensive installation guide. The Claude Code docs cover that well enough. This is about the thinking behind the setup, the decisions, the structure that works, and why I built it this way.

(Want my complete setup with all the commands? Comment and I'll drop a link to the repo)

The Stack (And Why I Didn't Overthink It)

Before we get into the Claude Code magic, here's what you'll need:

Node.js - Claude Code runs on it, and most modern dev tooling expects it.
GitHub CLI - This is non-negotiable. Claude Code integrates perfectly with gh, and it's the difference between "create a PR manually through the web UI like a caveman" and "Claude handles the entire git workflow autonomously."
Claude Code itself - npm install -g @anthropic-ai/claude-code. That's it. No ceremony.

The installation itself is straightforward - Node.js from nodejs.org, Claude Code via npm. Standard stuff. What's more interesting is why this stack matters for autonomous workflows.

The GitHub CLI handles all the git plumbing - creating branches, opening PRs, managing issues. Claude Code is the engine that reads your project, executes commands, and makes changes. Your command structure? That's what keeps Claude from deciding to rewrite your entire auth system when you asked it to fix a button colour.

That's the stack. Simple, effective, keeps Claude on a leash.

The GLOBAL_STANDARDS File: Your AI's Rulebook

Remember that Lincoln quote about sharpening the axe? This is where you do it.

The GLOBAL_STANDARDS.md file lives in ~/.claude/ and gets referenced by every project. Think of it as the framework that Claude Code has to follow. My file is about 200 lines, and it's saved me from countless hours of cleaning up Claude's mess.

I didn’t start with a perfect 200-line standards document. I started with around 20 lines of “don’t be a dickhead” guidance. The rest came from Claude doing stupid shit and me adding rules afterwards.

The standards file is me dealing with the problems as they happen, instead of hoping Claude magically figures out what I want. Every time something goes sideways, I write it down. Every time I catch myself thinking "why the fuck did it do it that way?" - a new rule.

What goes in it:

Core principles: SOLID, DRY, YAGNI, DDD. Not because these are sacred laws, but because they're MY defaults. Claude needs to know how I think, not guess based on whatever training data it has.

Code organisation standards: file naming conventions, folder structure, how to handle errors, where configs go. Boring? Yes. Essential? Absolutely.

Language-specific patterns: For me that's TypeScript strict mode, React functional components with hooks, no class components unless there's a bloody good reason. And yes, I know hooks have been the standard since 2019, but Claude sometimes still tries to generate class components because apparently it learned React from 2017 tutorials.

The bit everyone forgets: What NOT to do. "Don't introduce new dependencies without asking." "Don't refactor working code unless that's the explicit task." "Don't get clever with abstractions."

Examples:

Every time Claude does something stupid, I add a rule. That's how the file evolves. It's not some grand architectural document I spent weeks planning - it's a living record of "here's what went wrong and here's how to not do it again."

The alternative is doing what I used to do - just accepting that AI tools occasionally generate shit code, shouting at it – trying to understand why, spending hours cleaning it up, and then watching it make the same mistakes next week. That's exhausting. Writing down "don't do this" once is way less effort than fixing it repeatedly.

Building Commands That Actually Work

Commands in Claude Code are just markdown files. Simple, right? Wrong.

The difference between a command that works and one that goes wild is in the structure. Let me show you with the command I use most: /SHIP.

The full command is ~400 lines because it's very explicit about what to do at each stage. But that's the point - no ambiguity means fewer "creative interpretations."

At each stage (PLAN, BUILD, TEST, REVIEW), Claude pauses and shows me what it's done - the plan it created, the commits it made, the test results. I review, approve, or give feedback before it continues. It's less "watch AI code for you" and more "AI does the grunt work, you make the decisions."

Why it's structured this way: PLAN comes first - always. Force Claude to think before typing. No plan, no build. Each step has explicit checkpoints where I can approve, revise, or cancel. Testing isn't optional - UI changes get Puppeteer tests; business logic gets unit tests, no exceptions. And before any PR gets opened, Claude checks its own work against the standards file, physically ticks off a checklist.

What I learnt the hard way:

Early versions of this command didn't have the PLAN step. Claude would just start coding immediately. When it worked, it was brilliant. When it didn't, I had no reference point to course-correct because there was no plan to check against.

The self-review step came after Claude shipped code that technically worked but violated half my standards. You've got to remember that you're basically talking to a very academic student with no real-life experience. Pushing it to constantly check its work, review itself against what we've already discussed, and physically tick off a checklist keeps it in line a lot more.

Context: The Bit That Makes It Actually Work

Here's what most people miss when they talk about autonomous AI workflows: context gathering. This is the difference between Claude blindly generating code and Claude actually understanding your codebase.

I figured this out the hard way. Early versions of my workflow, Claude would generate perfectly valid React code that had nothing to do with how I'd built the rest of the application. Different naming conventions, different patterns. Technically correct, completely inconsistent.

It's as if you joined a new team and started writing code before you'd read any of the existing codebase. Sure, you can write working code - but it won't fit.

Scratchpads are Claude's working memory. Every time you run /plan, it creates a scratchpad file in scratchpads/issue-[number]-plan.md. That file contains the entire thought process - what contracts are needed, what files to create, edge cases to handle, testing strategy.

But here's the clever bit - those scratchpads don't just disappear after the task is done. They sit there. Next time Claude works on something similar, it searches them. "Have I solved authentication before?" "How did we handle form validation last time?" Instead of starting from zero every single time, it's building up actual knowledge of your codebase.

It's like having documentation that is for once, kept up to date because it's generated as part of building features, not as an afterthought nobody bothers with.

Past PRs are pattern recognition. When Claude starts planning a new feature, it searches through your recent merged PRs. Not to copy code - to understand patterns. "How did we structure the last repository?" "What was the approach for error handling in the last API endpoint?"

This surprised me. I expected Claude to just look at the current state of the code. Turns out, looking at the history of how features were built is way more useful. It sees the patterns, the decisions, the "we did it this way because of X" context that doesn't exist in the final code.

Related files are semantic search. Claude doesn't just look at files you mention. It searches your codebase for conceptually related files. Working on a user profile feature? It'll find other user-related files even if they're not explicitly mentioned in the issue.

It understands the domain structure, not just the file names. Which matters because half the time I don't remember every file that touches a particular feature. Claude searches, finds the related bits, and references them in the plan. Saves me having to play "grep detective" through my own codebase.

This is why the PLAN stage is non-negotiable. That's when all this context gathering happens. Claude reads the issue, searches scratchpads, checks past PRs, finds related files, and THEN creates a plan. By the time it starts building, it's got the full picture.

Without this, you're just getting generic AI code that might work but doesn't fit your codebase. With it, Claude's writing code that looks like you wrote it, because it's learned from everything you've already built.

The difference is the same as someone who's worked on your codebase for six months versus someone who just read the README. One of them knows why things are the way they are. The other is guessing.

Setting Up a New Project

Once you've got your global config sorted, project setup is straightforward:

Create the repo (I use GitHub CLI: gh repo create project-name --public --clone)
Add a project-specific CLAUDE.md - this is where project context lives
Create .claude/commands/ for any project-specific commands
Add a .claudeignore - keep Claude out of node_modules, .git, build artifacts
Create a scratchpads/ directory - this is Claude's working memory

The project CLAUDE.md is where you put the brief, architecture decisions, anything Claude needs to know about THIS specific project. Think of global config as "how to write code" and project config as "what we're building."

The scratchpads directory is where Claude stores all its plans. Over time, this becomes a knowledge base of how features were built, what decisions were made, what problems were encountered. It's like having documentation that actually stays up to date because it's generated as part of the build process.

Does It Actually Work?

Theory is nice. Let's see it in practice.

When I run /SHIP issue-123, here's what happens:

Claude reads the GitHub issue
Searches scratchpads and past PRs for context
Creates a detailed plan and waits for my approval
Creates a feature branch and implements the solution
Runs tests (generates new ones if needed)
Self-reviews against the standards checklist
Opens a PR

Time breakdown: ~10-15 minutes total for a moderate feature. My involvement? About 2 minutes reviewing and approving at each checkpoint.

If Claude misunderstands the issue, I catch it at the PLAN stage before any code is written. If the implementation goes sideways, the REVIEW step catches standards violations. If tests fail, the workflow stops - no PR gets created with failing tests. At any checkpoint, I can give feedback and Claude iterates.

Progress not perfection. It's still way better than "vibe coding" and hoping for the best.

What's Next

You've got the setup. You've got the commands. You've got the guardrails. More importantly, you understand why context gathering is the secret sauce that makes autonomous workflows actually work.

In Part 3, I'll walk through building a real project with this workflow - from project brief to working MVP. You'll see where it shines, where it struggles, and how to course-correct when Claude gets ideas above its station.

For now, set it up. Try it. Break it. Add your own commands. Make it yours.

And if Claude goes rogue? Well, that's what the approval checkpoints are for.

Want the full setup? Comment "SETUP" below and I'll share the repo with all my commands, standards file, example project structure, and those scratchpad templates that make context gathering work.

I'm Mike Nash, an Independent Technical Consultant who's spent the last 10+ years building software, leading engineering teams, and watching the industry evolve from WebForms to AI-powered development.

These days I help companies solve gnarly technical problems, modernise legacy systems, and build software that doesn't fall over. I also apparently write blog posts about shouting at AI tools.

Want to chat about AI workflows or consulting?
Connect with me on: LinkedIn or comment below.

Claude Code: Why AI Needs Guardrails

Michael Nash — Wed, 04 Feb 2026 12:29:06 +0000

This is Part 1 of a 3-part series on building AI coding workflows that don't go rogue.

I’ve been tinkering with AI, as everybody should be, especially those in the tech space.

It’s completely disrupted the industry, and I feel like we’ve got three lanes that people are putting themselves into: AI evangelists, anti-AI, and the place in-between. The latter recognises the potency of what’s available for our use, but acts with caution, especially within software development. I’m sitting in that lane.

We are seeing countless posts and adverts of people “vibe coding” and loving the fact that they can whip up a small application in a weekend. Who doesn’t love that? I think it’s fantastic that people want to play, that they want that quick hit from seeing something come to life, or they just want to prototype something. Turning that into an enterprise-level application, however, that’s where things get hairy.

Without the right frameworks, guardrails, knowledge, and plans in place, it’s the equivalent of giving a toddler a power tool. Powerful weapon, delicate hands.

I’ve spent over 10 years watching this industry evolve. From WebForms to SPAs, from IIS deployments to cloud infrastructure, through the rise and fall of countless frameworks. AI is just the latest toolset at our disposal. The difference? It’s exponentially sped up the coding stage. That could be disastrous or fantastic, depending on who’s wielding it.

I’ve recently been experimenting with the latest hype, Claude Code. Prior to this, I’ve delved into ChatGPT (who hasn’t?), OpenAI integration, Copilot, Cursor and various others, and they’ve all been useful in their own ways. Now, I’ve become interested in agentic AI flows through Claude Code’s CLI.

The Claude Code CLI is Anthropic’s agentic coding tool you run in your terminal. It reads your entire project context, takes natural language commands, and can propose changes, create commits, run tests, and handle GitHub issues. You can impose quality gates with human-in-the-loop approval or let it run wild.

I’ve made plenty of mistakes in the past, especially with Cursor – outsourcing too much development or trying to tackle humongous changes. It always ends up with the same result: me losing all empathy and shouting and cursing at the AI like it’s a disobedient apprentice for not following exactly what I’ve been saying, going completely rogue and introducing copious amounts of bullshit I didn’t ask for, or breaking everything else in the process.

I once asked it to refactor a CQRS pattern. It decided that wasn’t ambitious enough and started introducing commands and queries I’d never even thought of, let alone asked for. Cheers for that, mate.

Here’s the thing: coding itself has never been the most complex part, not after a few years in the industry anyway. Before AI, we had StackOverflow, trawling through documentation and Google Fu. The crucial bit? Understanding whatever weird fix you were about to copy into your codebase, not doing it blindly. We essentially have the same solutions now, but on steroids and faster. Understanding is still crucial.

So, when I started building with Claude Code, I knew I needed that same discipline – understand first, execute second.

My challenge here was to attempt to make a flow as autonomous as possible, following the best practices that I have set, and handling large workflows: from planning to breaking down tasks, editing files, testing and creating pull requests with a human-in-the-loop approach in order to minimise risk.

I’ve found AI most useful for scaffolding, building foundational pieces, handling the mundane tasks. Essentially using it as an assistant that I mentor – not outsourcing my thinking to it. The longer you spend designing and building software, you realise: the less code, the better. You start thinking in terms of systems, design patterns, architectural decisions. The same principles apply with AI: you’re still the engineer – it’s just doing the heavy lifting.

Claude Code operates on command files you create in plain English. You can set global standards and project-specific briefs – chain commands together – and the more detail you give it, the better it performs. The flow I built breaks work down systematically – from project brief to individual tasks. It then uses a /SHIP command with quality gates at each stage: plan, build, test, self-review and PR creation. Human-in-the-loop approval is required at every checkpoint. I’ll walk through the exact setup in Part 2.

My approach evolved to include a GLOBAL STANDARDS file: my core principles (SOLID, DRY, YAGNI, DDD, and so forth) and standards for code organisation, error handling, security, accessibility – everything I care about when designing software. It’s essentially the popular adage attributed to Abe Lincoln: “If I had six hours to chop down a tree, I would spend the first four hours sharpening the axe.”

Preparation and smart work are more efficient and effective than brute force.

In Part 2, I’ll walk through exactly how to set up this workflow: the commands, the structure, the guardrails. Then, in Part 3, we’ll use this to turn a project brief into a working MVP. Stay tuned.

About Me

These days I help companies solve gnarly technical problems, modernise legacy systems, and build software that doesn't fall over. I also apparently write blog posts about shouting at AI tools.

Want to chat about AI workflows, technical leadership, or swap war stories?
Connect with me on LinkedIn or drop a comment below.