Forem: Karl Wirth

The Bugs AI Writes: Five Patterns That Show Up in AI-Generated Code

Karl Wirth — Mon, 20 Apr 2026 16:17:15 +0000

Reviewing AI-generated code has quietly become one of the most time-consuming parts of modern software development. As AI coding tools move from autocomplete to autonomous agents, developers are spending more of their day reading diffs they didn't write.

VentureBeat recently reported that 43% of AI-generated code changes need debugging in production. ByteIota found AI code produces 1.7x more issues per pull request than human code. And 60% of AI code faults are "silent failures" that compile and pass tests but produce wrong results.

The stats alone aren't useful unless you know what to look for. Across thousands of AI-generated diffs, the bug patterns are consistent enough to categorize.

Pattern 1: Plausible but wrong logic

The most common and hardest to catch. AI writes code that looks correct and passes basic tests but handles edge cases incorrectly.

Example: an agent writes a date parser that handles common formats fine but silently converts ambiguous dates like "04/05/2026" using US formatting when the codebase uses ISO 8601. No error, no crash, just wrong data.

AI agents optimize for the happy path. They write code that works for the test cases you'd think to write, but miss implicit conventions.

Catch it: Review AI code like code from a smart contractor who just joined. Check assumptions about data formats, timezone handling, null behavior, and business rules.

Pattern 2: Confident refactoring that breaks callers

When an agent refactors a module, it makes the module internally cleaner while subtly changing the external contract. Renamed parameters, changed return types, modified defaults.

TypeScript catches the obvious interface changes. It doesn't catch behavioral changes three files away where code depended on the old behavior.

Catch it: When reviewing a refactor, search the codebase for every caller of the refactored interface. If the agent says "simplified the return type," check whether any caller depended on the complexity that was removed.

Pattern 3: Tests that test implementation, not behavior

AI writes tests that pass by construction. A common example: tests where the expected value is literally copied from the function's return value rather than independently calculated.

Another variant: mocking everything so the test validates the mocking framework, not the code.

Catch it: Ask: "Would this test fail if the function returned a hardcoded value?" Favor integration tests over unit tests for AI code. Mocks should be the exception.

Pattern 4: Copy-paste drift across similar components

When creating multiple similar components, the agent copies from the first but doesn't copy consistently. One endpoint validates input, another doesn't. One component handles loading states, its sibling doesn't.

Each component looks fine in isolation. The inconsistency only shows when you compare them.

Catch it: Diff similar components against each other. Any difference should be intentional. Inconsistencies usually mean the pattern should be extracted into a shared abstraction.

Pattern 5: Dependency and import sprawl

AI agents install packages liberally. Asked to add a date picker, they'll pull in a new date library even when one already exists in the project.

Catch it: Check whether the project already has a library for the same purpose. Document preferred libraries in CLAUDE.md so the agent knows what's available.

The review process for AI code

AI code review requires different assumptions than traditional review:

Assume no institutional knowledge. The agent doesn't know your conventions unless documented.
Review boundaries, not internals. Bugs live at interfaces: function signatures, API contracts, error handling, data formats.
Test behavior, not implementation. Run the code under real conditions.
Check what wasn't changed. If the agent added a feature, check whether existing error handling still applies.
Scope tasks tightly. A 30-minute, 3-file task is reviewable. A 2-hour, 20-file task is a coin flip.

Why this scales poorly without process

The 43% debugging rate isn't because AI writes bad code. Traditional review catches human mistakes (logic errors, forgotten cases, typos). AI makes different mistakes. Teams that handle this well:

Document everything the agent needs to know (architecture decisions, conventions, preferred libraries)
Scope tasks small enough to review thoroughly
Treat review as a first-class activity, not something to rush through

The code quality bar doesn't change because the author isn't human. The failure modes are less familiar, which means the review process needs to be more deliberate.

Originally published on [*nimbalyst.com/blog](https://nimbalyst.com/blog/bugs-ai-writes-patterns-in-ai-generated-code/). Nimbalyst is a visual workspace built on Claude Code for managing AI coding workflows.*

Author Bio (for all three posts)

Karl Wirth is the founder of Nimbalyst, a desktop workspace built on top of Claude Code that adds visual editing, multi-agent orchestration, session management, and scheduled automations to AI-assisted development. He writes about AI coding tools, agent orchestration, and running a small company that ships a lot.

Claude Code Routines: A Practical Guide from Someone Already Automating AI Workflows

Karl Wirth — Mon, 20 Apr 2026 16:15:03 +0000

Three days ago, Anthropic shipped Routines for Claude Code. If you missed it: a routine packages a prompt, repos, and connectors into a configuration that runs on a schedule, responds to API calls, or triggers on GitHub events. Runs on Anthropic's cloud, laptop can be closed.

I've been building similar automation workflows for months using different tooling (Nimbalyst automations, custom scripts). Routines makes this pattern accessible to every Claude Code user. Here's what I've learned about which automations actually work and where the limits are.

The Three Triggers

Routines support scheduled (hourly/daily/weekly), API (HTTP POST with bearer token), and GitHub event triggers. Each run creates a full Claude Code cloud session with shell access, skills, and connectors.

Workflows That Deliver Real Value

Issue triage (daily/nightly): Scan new GitHub issues, cross-reference with your codebase to identify affected modules, apply labels, estimate priority, post a summary to Slack. The AI doesn't just categorize text; it reads the code and makes informed severity assessments.

Documentation drift (weekly): Scan merged PRs, identify docs referencing changed APIs, open update PRs. This catches staleness before it becomes a support burden. Nobody has time to do this manually, which is exactly why it should be automated.

Deploy verification (event-triggered): After deploys, run smoke checks, scan error logs, post a go/no-go assessment. Not replacing your test suite, but adding an AI review layer that reads logs with more context than threshold checks.

Constraints to Know

Usage caps: Pro gets 5 runs/day, Max gets 15, Team/Enterprise gets 25. Plan your cadences accordingly.
No mid-run interaction: Routines run fully autonomously. Best for tasks producing reports, PRs, or messages. Not for work requiring human judgment during execution.
Cloud-only: Clones your repo to Anthropic's infrastructure. No access to local tooling or network services.

Getting Started

Visit claude.ai/code/routines or type /schedule in the CLI.

Start with weekly documentation drift detection:

Scan PRs merged in the past 7 days. For each, identify docs
referencing modified functions or APIs. If outdated, open a PR
with suggested updates.

Review the first few outputs, calibrate the prompt, then let it run.

The Layered Approach

The most productive setup uses multiple layers:

Event-driven routines for immediate responses (PR triggers)
Scheduled routines for periodic maintenance (nightly triage, weekly doc checks)
Local automations for environment-specific work (custom tooling, workspace context)
Interactive sessions for complex, judgment-heavy work

Trying to push everything through one approach leaves gaps. Routines handle the cloud-native, repetitive layer well. For local environment work, you need complementary tooling.

The Bigger Picture

In the first two weeks of April, Cursor shipped Cursor 3 (agent-first workspace), Windsurf launched 2.0 (Agent Command Center + Devin), and Anthropic redesigned Claude Code with Routines. All converging on the same idea: the developer's role is shifting toward orchestrating agents, not writing every line.

Routines are the automation edge of this shift. Start with one, run it for two weeks, calibrate, then add more. Build your automation layer incrementally.

I'm Karl, building Nimbalyst, a visual workspace on top of Claude Code. I write about AI-native development workflows and what actually works in practice.

Original article on nimbalyst.com/blog.

What actually breaks when you run 5+ Claude Code agents in parallel

Karl Wirth — Mon, 20 Apr 2026 16:13:31 +0000

The parallel-agent workflow stopped being a frontier a few weeks ago. Cursor 3's Agents Window, Windsurf 2.0's free parallel agents, and Anthropic's April 14 Claude Code desktop redesign (multi-session sidebar, per-session worktrees, rebuilt diff viewer, Routines for scheduling) all ship the same core idea: run many agents in isolated worktrees from one surface. If you were still juggling raw terminal tabs last week, upgrade this week.

I've been running four to six parallel sessions every day for the last two months, across and ahead of these releases. Here's what the new tools still don't fix.

1. A session list is not the same as knowing what an agent is doing. A sidebar with a row per session and a status chip beats zsh, zsh 2, zsh 3. It doesn't tell me that session 3 is stuck reconciling three conflicting test fixtures and that I already have notes about those fixtures in another session from last Tuesday. The session is a row. The work is a richer object than a row.

2. Finding the pinging session is better, not solved. "Session 3 needs input" is a real improvement over terminal bells. What's missing is the connective tissue: I want "session 3 (refactor file watcher, linked to tracker #432) is asking whether to keep the old onChange signature" so I can answer in context instead of alt-tabbing into a transcript and reading back.

3. Cross-session diff review is still brutal. Every new tool rebuilt the diff viewer. None of them handle the common case where three parallel agents touch coupled code (file watcher + new IPC handlers + tests for both) and need a single combined review, not three separate ones.

4. Context handoff between agents is still manual. Agent A designs a data model. Agent B writes the migration. Agent C writes tests. Each agent can read the code. What they can't recover is the reasoning from the previous session's transcript, which is where most of the important context lives. Every tool has a transcript, and every transcript is trapped in the session that produced it. It's worse when you mix Claude Code and Codex in the same day (I do — they're better at different things), because now the transcripts aren't even in the same tool. I still end up copy-pasting.

5. Scheduled work needs the whole workspace. Claude Code Routines and Cursor's scheduled agents are here. What's still missing is scheduling that can read and write across the whole workspace: open sessions, tracker items, notes, decisions, yesterday's transcripts. A stateless scheduled script does 60% of that. A scheduled agent with full workspace context does all of it.

The parallel-agent layer is now table stakes. Session management with worktrees is shipped by every serious AI coding tool. The workspace around the sessions (work as a first-class object, cross-session review, shared context that travels, scheduled agents with full workspace access) is still mostly empty, and it has to work across the agents you actually use, not just one vendor's. I'm building into that gap with Nimbalyst, a desktop workspace that runs sessions across Claude Code and Codex in the same project, treats every session as a card on a kanban, keeps transcripts, notes, diagrams, and data models in the same file tree so any agent can read them, and runs scheduled automations that share that context.

If you've hit these same gaps, I'd love to hear how you're solving them. Drop a comment or reply with your setup.

Original article on nimbalyst.com/blog.

I Tried Every Claude Code Editor. Here Is What Actually Works

Karl Wirth — Wed, 15 Apr 2026 18:47:54 +0000

Claude Code itself is not the hard part.

The hard part is everything around it: planning the work, tracking multiple sessions, reviewing diffs, and keeping branch state sane once you stop using it like a toy and start using it like part of your real workflow.

That is what I was optimizing for when I went looking for the best Claude Code interface.

Two disclosures up front:

I care more about workflow than pretty chat UI
I am the founder of Nimbalyst, so I am biased and I should say that plainly

I tried the common options and kept coming back to one question:

Does this tool make Claude Code easier to supervise once the agent is doing serious work?

4. Raw Terminal

This is still the cleanest starting point.

Open a terminal, run claude, and get to work. Nothing is faster for one-off tasks, scripted workflows, or short focused sessions. If you already live in tmux, you can get surprisingly far with this setup.

Where it breaks is not coding. It is management.

Once you have multiple sessions, the terminal becomes a memory test. Which tab owns which task? Which session changed what? Which one is waiting for input? Which branch is safe to merge?

Best for: single-session workflows, shell-heavy users, quick tasks

3. VS Code + Integrated Terminal

This is probably the most common real-world setup.

VS Code gives you a file tree, editor, git panel, and diff viewer in the same place where Claude Code is running. That is enough for a lot of people. You get a better review surface than raw terminal without changing your stack.

The weakness is that Claude Code is still basically "a terminal tab inside VS Code." The editor helps with inspection, but it does not really help with orchestration. If you open three concurrent sessions, you are still juggling tabs manually.

Best for: developers who already live in VS Code and usually run one or two sessions

2. Zed

Zed is the option I would pick if my main complaint was editor drag.

It is fast, visually quiet, and better than heavier editors at staying out of the way. Claude Code works well there because a good terminal, quick navigation, and responsive diff inspection already solve a lot of the daily pain.

The tradeoff is ecosystem depth. If you rely on very specific extensions or highly customized IDE workflows, Zed may feel narrower than VS Code. But if your editor job is mostly "be fast while Claude Code does the heavy lifting," Zed is excellent.

Best for: developers who want speed and minimal overhead

1. Nimbalyst

Nimbalyst is the tool that is designed for the actual bottleneck: agent supervision.

The useful part is not just that it can run Claude Code. It is that it treats sessions, tasks, plans, mockups, markdown, excalidraw, diffs, and supporting artifacts as part of the same job. You can manage multiple sessions, inspect file changes by session, work from plan documents, and review outputs in a way that feels built for parallel agent work instead of retrofitted after the fact.

It also matters that Nimbalyst is not just a shell around the agent. It includes a local code editor, document editors, mockups, diagrams, file history, and mobile monitoring. That makes it materially different from tools whose main value is "nicer transcript UI."

The tradeoff is obvious: it is a bigger system. If you only run one Claude Code session at a time, Nimbalyst may be more workflow than you need. If you run several, the value shows up quickly.

Best for: developers and teams managing multiple Claude Code sessions, plans, and reviews at once

What About Cursor and Windsurf?

They matter, but I think of them differently.

Cursor and Windsurf are strong AI-native editors. I would absolutely consider them if your primary goal is inline AI editing inside the editor itself. But for Claude Code specifically, they are usually complements, not true wrappers. Claude Code still tends to live in a terminal panel while the editor's own AI system handles the native experience.

That makes them good choices for mixed workflows and less clean choices if your question is narrowly:

"What interface is best for Claude Code itself?"

Comparison Table

Tool	Best at	Breaks when	Right user
Raw terminal	Speed and control	You run multiple sessions	tmux and shell-heavy users
VS Code	Familiar editing + diffs	Session coordination gets messy	most developers
Zed	Fast, low-friction editing	You need a broader ecosystem	performance-focused users
Nimbalyst	Multi-session supervision	You only want a lightweight wrapper	daily Claude Code users

What Actually Matters More Than the UI

No interface saves you from a bad workflow.

The three things that matter most are still:

Write a plan before starting the agent
Use git worktrees for concurrent sessions
Review diffs carefully before merging

If you get those right, even the terminal can work.

If you get those wrong, the nicest GUI in the world will mostly help you fail more comfortably.

Final Take

If you use Claude Code occasionally, stay simple. VS Code is enough.

If you are using Claude Code as a daily operating system for real work, the problem changes. You stop needing "a better chat box" and start needing a better control plane.

That is where Nimbalyst is focused.

Karl Wirth is the founder of Nimbalyst, a local workspace for Claude Code and Codex.

Claude Code Development Workflow: Tools and Setup Guide for 2026

Karl Wirth — Wed, 15 Apr 2026 18:44:16 +0000

Claude Code gets dramatically better when you stop treating it like "a chatbot in a terminal" and start treating it like part of a repeatable engineering workflow.

The difference is usually not model quality. It is setup quality. Teams that get strong results do the same few things over and over: give the agent persistent context, write plans before prompting, isolate concurrent work, and review output like it came from a fast junior engineer.

This is the setup I recommend if you want Claude Code to be useful on real projects, not just impressive in demos.

TL;DR Setup Checklist

Add a project-level CLAUDE.md
Create short plan docs before starting implementation
Use git worktrees for concurrent sessions
Auto-approve safe operations, but keep risky ones gated
Pick one review surface and use it consistently

Running concurrent sessions in one directory

Every serious Claude Code workflow starts here.

Without a root CLAUDE.md, every session has to rediscover your architecture, commands, conventions, and constraints. That wastes prompt budget and leads to avoidable mistakes. With it, the agent starts from a usable baseline.

A good CLAUDE.md is short, concrete, and opinionated:

# CLAUDE.md

## Project Overview
Monorepo for a React frontend, Node API, and PostgreSQL database.

## Development Commands
- Start web: `npm run dev:web`
- Start api: `npm run dev:api`
- Test: `npm run test`
- Lint: `npm run lint`

## Architecture
- `/apps/web` - React frontend
- `/apps/api` - HTTP API
- `/packages/ui` - shared components
- `/packages/db` - schema and queries

## Conventions
- TypeScript strict mode
- Zod for request validation
- Never query the DB directly from route handlers

## Guardrails
- Do not change auth middleware without explicit review
- Keep migrations backward compatible
- Prefer existing patterns over new abstractions

The key is specificity. "Use TypeScript" is weak. "Use TypeScript strict mode and validate requests with Zod" is useful.

Claude Code gets much better once it can inspect more than the current file tree.

The three categories that matter most are:

Repository context: GitHub or git tooling so the agent can understand issues, PRs, and branch state
Data context: schema or safe query access for local/dev databases
Project context: any internal tools, docs, or MCP servers your team already relies on

The mistake here is over-configuring. Do not hand the agent ten tools you barely trust. Start with the few that meaningfully improve accuracy.

3. Set Permission Rules So the Agent Is Fast but Not Reckless

The default "ask me for everything" experience is safe and miserable. The opposite extreme, where the agent can do anything without friction, is how you end up approving bad work after the fact.

A practical default looks like this:

Auto-approve reads inside the repo
Auto-approve writes inside the repo
Auto-approve test runs and other common safe commands
Keep approvals for installs, git history rewrites, and anything outside the project

You want Claude Code to move quickly through normal implementation, but you still want a hard pause on operations that change system state or create cleanup work.

4. Write a Plan Before You Prompt

This is the highest-leverage habit in the entire workflow.

Do not start with:

build me user authentication

Start with a plan document in markdown that you iterate on with the agent:

# Feature: User Authentication

## Goal
Session-based auth with registration, login, and password reset.

## Constraints
- Use bcrypt
- Store sessions in PostgreSQL
- Rate limit login attempts
- All routes under `/api/auth/*`

## Acceptance Criteria
- [ ] User can register
- [ ] User can log in and get a session cookie
- [ ] User can reset password
- [ ] Failed logins are rate limited

Then point Claude Code at the file:

claude "implement docs/auth-plan.md"

This is better than a long natural-language prompt because the plan becomes reusable. You can refine it, hand it to another agent, or review the finished work against it.

5. Use Git Worktrees for Parallel Sessions

If you run more than one Claude Code session at a time in the same checkout, you are choosing pain.

Use worktrees instead:

git worktree add ../project-auth -b feature/auth
git worktree add ../project-tests -b feature/auth-tests
git worktree list

Now each session gets:

its own branch
its own working directory
no file collisions with the others

That is the foundation for reliable parallel work.

Example:

# API work
cd ../project-auth
claude "implement docs/auth-plan.md"

# Test work
cd ../project-tests
claude "write tests for docs/auth-plan.md"

Worktrees are the single best upgrade for anyone moving from "one agent sometimes" to "multiple agents regularly."

6. Pick a Session Management Pattern

Once you have parallel sessions, you need a way to keep them straight.

Three reasonable options:

Terminal tabs: fine for one or two sessions
tmux: still the power-user default for keyboard-heavy workflows
Nimbalyst: useful if you want a visual board for sessions, file changes, diffs, and plan artifacts in one place

The right choice depends on your failure mode.

If your issue is "I lose track of which session is running where," a visual session surface helps.

If your issue is "I want everything on one keyboard-first screen," tmux is still excellent.

7. Review Like the Agent Is Usually 90% Right

Claude Code is often right enough to feel finished before it actually is.

That is why review matters. The common misses are not obvious syntax errors. They are:

slightly wrong edge-case handling
assumptions about existing abstractions
tests that prove the happy path but not the failure path
code that works but does not match local conventions

For small changes, git diff or your editor is enough.

For larger changes, use a proper review surface:

editor diff view
draft pull request
a tool that tracks per-session file changes and inline diffs

The more files an agent touches, the less acceptable "quick skim and merge" becomes.

8. My Default Claude Code Loop

This is the loop I would teach a team:

Write or update a short plan
Create a worktree
Start Claude Code against that plan
Let it run until it blocks or finishes
Review the diff against the plan
Fix or redirect
Merge and remove the worktree

That is the whole system. Most of the value comes from doing that loop consistently.

Common Mistakes

Starting from a blank prompt

If you skip the plan, Claude Code fills in the blanks with its own assumptions.

Over-specifying implementation

Tell the agent what good looks like, what constraints matter, and what must not break. Do not micromanage every function unless there is a real reason.

Reviewing too late

Do not wait until a giant session is "done" before looking. Review earlier on bigger tasks.

Final Take

The best Claude Code workflow is not complicated. It is disciplined.

Persistent context, short plans, worktree isolation, and careful review beat almost every fancy trick. Once those pieces are in place, you can decide whether you want to stay in a terminal, live in tmux, or use a more visual workspace around the agent.

Without that foundation, the tool choice barely matters.

Karl Wirth is the founder of Nimbalyst, a local visual workspace for building with Claude Code and Codex.

Best Practices for Coding with Agents

Karl Wirth — Mon, 09 Mar 2026 15:58:45 +0000

You already know coding agents can write code. The interesting question is what happens when you stop thinking of them as code generators and start treating them as junior developers who need good specs, clear test criteria, and visual references — just like a human would.

We build Nimbalyst this way every day. Here’s our workflow:

Write a plan in markdown. Edit this. Iterate.
Have the agent enrich it with architecture diagrams and data models. Edit this. Iterate.
Iterate on mockups until the UI is right
Have the agent write tests from the acceptance criteria. Edit this. Iterate.
Tell it to implement until tests pass
Walk away. Check in from your phone.
Review the work. Suggest changes.
Commit
Update plan document, documentation, website

Each step produces context that the next step consumes. By the time the agent starts writing code, it has the spec, the architecture diagram, the database schema, the mockup, and the test suite — all in one workspace, all visible to it. That’s why it works.

Plan First

Every feature starts as a markdown file with YAML frontmatter: status, priority, owner, acceptance criteria. We type /plan and iterate on the document with the agent until the goals and implementation approach are solid.

Plan document with YAML frontmatter, status bar, and inline AI diff in Nimbalyst

The plan isn’t a throwaway note. It tracks status as work progresses (draft -> in-development -> in-review -> completed), versions with git alongside the code, and serves as the single source of truth. When the agent later implements, it reads this document. When we review the work, we compare against it.

The difference between this and a Jira ticket: the plan is a rich markdown document that the agent can actually parse and act on, not a text field nobody reads.

Diagrams and Data Models in the Same Workspace

A text plan can only communicate so much. We ask the agent to add visual context:

“Add an architecture diagram showing the WebSocket connection flow between the client, server, and notification service.”

It creates an Excalidraw diagram in the workspace. We see it rendered, drag things around, tell the agent to adjust (“move the queue between the API and the notification service”), and iterate until the architecture is clear.

Excalidraw architecture diagram with AI chat sidebar in Nimbalyst

For database work, we ask for a data model:

“Create a data model for the notifications schema.”

The agent generates a .datamodel file that renders as a visual ERD. Tables, foreign keys, field types, all editable. We review it, ask for changes, the agent updates it.

Data model rendered as visual ERD with AI chat in Nimbalyst

The critical thing: these artifacts live alongside the plan and the code. When the agent later implements, it doesn’t need us to re-explain the architecture or the schema. It reads them directly.

Mockup, Annotate, Iterate

For anything with a UI, we create mockups before touching code. The agent generates .mockup.html files — real HTML/CSS that renders live in the workspace.

HTML mockup with before/after diff slider and AI chat in Nimbalyst

We review visually. Use annotation tools to circle what needs changing. The agent sees the annotations, understands the spatial context, and regenerates. Three or four rounds and the mockup matches what’s in our heads.

This replaces the Figma-to-engineering handoff entirely. The mockup is already in the workspace. When the agent implements the UI, it already knows what it should look like. No exporting, no describing screenshots in words, no “make it look like the design.”

Tests Before Implementation

Before writing any implementation code, we have the agent write tests. This is where the earlier context pays off.

“Write Playwright E2E tests for the notification center based on the plan and mockup.”

The agent reads the acceptance criteria from the plan, references the mockup for expected UI behavior, and generates test cases. We review them, add edge cases, and now we have an executable definition of “done.”

Every test fails. That’s the point.

Implement Until Green

Now we tell the agent:

“Implement the notification system. Run tests after each major change. Keep going until all tests pass.”

The agent works iteratively. Implements the database migration from the data model. Runs tests — schema tests pass. Builds the WebSocket server. Runs tests — connection tests go green. Implements the frontend. Runs Playwright — catches a CSS issue from the screenshot, fixes it, reruns. Eventually: all green.

This isn’t prompt-and-pray. The agent has the plan for architecture guidance, the data model for the schema, the mockup for the UI, and the test suite for verification. It loops through code-test-fix cycles autonomously.

Walk Away, Check In From Your Phone

Once the plan is solid, the tests are reviewed, and the agent is pointed in the right direction, we don’t sit and watch. We go to lunch. Take a meeting. Go for a walk.

The agent keeps working.

When it finishes or needs input, we get a notification on our phones. The Nimbalyst mobile app shows session status, the full transcript of what the agent did, and file diffs. If the agent needs a decision, we tap our answer and it continues. If all tests pass, we review the changes from wherever we are.

Nimbalyst mobile app showing active agent sessions and status

Nimbalyst mobile app showing file diff review on phone

This is not “set it and forget it.” We stay engaged. But the engagement happens on our terms — on the train, at the coffee shop, between meetings. The agent’s work doesn’t stall because we’re not at our desk.

Review the Work

When the agent finishes, we review. Nimbalyst shows every file the agent touched in a sidebar, with full diffs for each one. We click through the changes, see exactly what was added, modified, or removed, and compare it against the plan and mockup.

Diff review showing file changes with red/green inline diff in Nimbalyst

This isn’t reading a pull request cold. We wrote the plan, reviewed the tests, and approved the mockup. The review is checking whether the agent followed through on decisions we already made. It usually did. When it didn’t, we tell it what to fix and it iterates.

The files sidebar makes this fast. We see the full scope of changes at a glance — no scrolling through a massive diff. Click a file, review it, move on.

Commit

Once the review looks good, we commit directly from the workspace. Nimbalyst has a built-in git commit flow — the agent proposes a commit message based on the changes, we review or edit it, and commit.

Git commit dialog with AI-proposed commit message in Nimbalyst

No switching to a terminal. No copying file lists. The commit happens in context, right after the review, while everything is still fresh. The agent’s proposed message is usually accurate because it knows what it did and why — it read the plan.

Update the Plan and Docs

After committing, we close the loop. The plan document gets updated: status moves from in-development to completed, acceptance criteria get checked off, and any implementation notes get added for future reference.

Session kanban board showing work items across phases in Nimbalyst

We also update documentation, CHANGELOG entries, and website content if the feature is user-facing. Because the agent has full context of what was built, it can draft these updates too. We review and merge.

This step matters more than it seems. Without it, plans drift from reality, docs go stale, and the next person (or agent) working in the area starts from incomplete context. Closing the loop keeps the workspace honest.

Why Context Continuity Is the Real Unlock

Coding agents are good at writing code. What limits them is context fragmentation.

In a typical setup, the spec lives in Confluence, the mockup in Figma, the tasks in Jira, the tests in your IDE, and the agent runs in the terminal. The agent gets fragments through MCP calls or copy-pasted text. It’s reading a book one sentence at a time through an API.

This workflow works because every artifact — plan, diagram, data model, mockup, test, code — lives in the same workspace and is directly readable by the agent. No handoffs. No context translation. No “let me describe what the Figma mockup looks like.”

The result:

Plans give the agent architectural direction it can actually follow
Diagrams show how pieces connect without verbal explanation
Data models define the exact schema to implement
Mockups provide a pixel-accurate UI target
Tests give a machine-verifiable definition of done
One workspace means none of this is lost in translation
The agent isn’t smarter in this workflow. It just has everything it needs.

The Workflow

Plan. Enrich with visuals. Mockup the UI. Write tests. Implement until green. Review from anywhere. Commit. Close the loop.

We ship features this way every day. The agent handles the mechanical iteration. We focus on the decisions that actually require a human: what to build, what the architecture should look like, whether the tests cover the right scenarios, and whether the final result is what we wanted.

That’s how we build Nimbalyst, and it’s how Nimbalyst is designed to let you build too.

Context Is the New Code

Karl Wirth — Mon, 09 Mar 2026 15:54:24 +0000

The Prompt Isn’t the Problem

The prompt is the last mile. The tip of the iceberg. What actually determines the quality of AI-generated code is everything you give the agent before the prompt: your spec, your architecture diagram, your mockup, your data model, your existing codebase, your test suite, your acceptance criteria.

That’s context. And context is the new code.

The Evidence

Consider two approaches to the same task. Same Claude Code model. Same feature request: “Build a team management settings page.” Same developer.

Approach 1: Prompt only The developer opens a Claude Code session and types: “Build a team management settings page where admins can invite members, assign roles, and remove people.”

The result is a generic CRUD page. Hardcoded role values. No error handling for edge cases. No loading states. No confirmation dialogs. No empty state. Functional but shallow. The AI had to guess at every design decision, and it shows.

Approach 2: Context-rich session The developer writes the same prompt, but the Claude Code session also has access to:

A 2-page spec with acceptance criteria, edge cases, and error states
A visual wireframe showing the exact layout, including empty state and delete confirmation
An entity diagram showing Team, TeamMember, and Invitation entities with relationships
The existing codebase’s component library and style patterns
Three relevant test files showing the project’s testing patterns
The result is a production-quality page matching the existing design system. Proper role-based access control. Loading skeletons. Error boundaries. Confirmation dialogs. Empty state. Tests following project conventions. Practically shippable.

The second approach doesn’t just produce marginally better code — it produces categorically different output. The AI stops guessing and starts executing. Same model, same prompt, radically different results. The difference is context.

Why Context Beats Prompts

A prompt tells the AI what to do. Context tells the AI what you know, what you’ve decided, and what good looks like.

Context provides constraints
“Build a settings page” has infinite solutions. A mockup showing the exact layout has one. Constraints improve output because they reduce the solution space to the right answer.

Context provides patterns
When Claude Code can see your existing components, it follows your patterns. When it can see your test suite, it writes tests the same way. Without context, it invents patterns. With context, it matches yours.

Context provides decisions
A spec with edge cases represents decisions you’ve already made. “When you remove the last admin, show an error” is a decision. Without that context, the AI either ignores the edge case or makes a different decision. Your context encodes your judgment.

Context provides standards
A data model with specific field names, types, and constraints is a standard. It eliminates the “what should I name this column?” decisions that produce inconsistent code when made ad hoc.

The Craft of Context

If context is the new code, then crafting context is the new programming.

Here’s what good context looks like for a feature:

Plan/Spec (markdown): What the feature does, who it’s for, acceptance criteria, edge cases, error states, non-requirements (what it explicitly doesn’t do)
Architecture (Mermaid or Excalidraw diagram): How the feature fits into the existing system. Services, databases, APIs involved.
Mockup (HTML/CSS): What the UI should look like. Layout, components, interactions, states (loading, error, empty).
Data model (visual schema): Entities, relationships, fields, constraints, indexes.
Existing code (codebase access): Component library, style conventions, test patterns, API conventions.
Test cases (markdown): What success looks like. Expected behaviors, user flows, and validation criteria the AI can code against.
And here’s the key: you’re not building all this context by hand. You’re building it with your AI agents. Claude Code helps you write the spec, generate the architecture diagram, scaffold the mockup, draft the data model. The AI is your co-author for context, not just for code. You bring the judgment and decisions; the agent helps you capture and structure them.

Each piece of context eliminates a category of decisions the AI would otherwise make randomly (or badly). The more context, the less variance. The less variance, the higher quality.

Excalidraw architecture diagram providing visual context for AI agents in Nimbalyst

The Context Workspace

This is why we built Nimbalyst as an integrated workspace rather than a standalone AI chat.

If context matters most, then the tool should optimize for context creation and delivery:

WYSIWYG markdown for writing specs that Claude Code can read and edit
Mermaid and Excalidraw diagrams embedded in specs for architecture context
MockupLM for visual mockups that Claude Code can see and reference
Excalidraw for data model and architecture diagrams that Claude Code uses as a blueprint
Session linking so context from previous sessions carries forward
File-to-session tracking so the AI knows which sessions touched which files
Everything in one workspace. Everything visible to Claude Code. No copy-pasting context from one tool to another. No “let me describe what the mockup looks like.”

The Implication for Teams

If context is the new code, then context quality is the new code quality.

Code review becomes context review. “Did you provide enough context before asking the agent to code?” “Is the spec missing edge cases?” “Does the mockup match the design system?” “Is the data model normalized?”

The best developers won’t be the best coders. They’ll be the best context crafters — people who can build complete, precise, well-structured context that produces excellent AI-generated code on the first try.

FAQ
Q: This sounds like you’re just saying “write better specs.” A: Partly, yes. But it’s more than specs. It’s the combination of specs + visual mockups + data models + existing code patterns all being visible to the AI simultaneously. That integration is the key — not any single piece of context.

Q: Doesn’t this make the PM role more important than the developer role? A: It makes context creation more important than code writing. Whether that’s the PM’s job or the developer’s job depends on your team structure. Many developers write their own specs. The point is that whoever crafts the context determines the output quality.

Q: What about complex code where context isn’t enough? A: Context doesn’t replace engineering expertise. For performance optimization, distributed systems, complex algorithms — you need deep technical knowledge. Context gets you 80% of features. The other 20% still needs human engineering judgment.

The Future is Small Teams

Karl Wirth — Mon, 09 Mar 2026 15:13:26 +0000

We’re a small team building software for small teams. That’s not an accident, it’s our thesis.

The future of building software, and honestly the future of building anything, belongs to small teams of 3 to 7 people. Not solo founders grinding alone. Not 500-person departments or 10,000 person companies with layers of coordination. Small, tight groups where everyone knows what everyone else is doing, where decisions happen in minutes instead of meetings, and where each person brings real leverage because they’re each commanding their own teams of AI agents.

Why Small Teams Win Now

Small teams have always had advantages: speed, trust, low communication overhead. But those advantages used to hit a ceiling. A 4-person team simply couldn’t produce the output of a 40-person team, no matter how efficient they were.

AI agents change that math completely.

Today, a single person working with AI agents can do the work that used to require a team of 5-10 people. They can write code, generate documentation, analyze data, create marketing assets, and manage workflows, all with AI handling the execution while they provide direction and judgment.

Multiply that by 3-5 people, and you have a unit that can match or exceed the output of teams many times their size. Each person becomes a force multiplier, not just a contributor.

The Coordination Cliff

Here’s the thing that doesn’t scale: coordination.

Brooks’s Law told us decades ago that adding people to a late software project makes it later. The underlying reason hasn’t changed — communication overhead grows exponentially with team size. With n people, you have n(n-1)/2 possible communication channels. At 5 people, that’s 10 channels. At 10 people, it’s 45. At 20, it’s 190.

AI agents amplify this problem in a new way. When each person is running multiple agent sessions, each producing code, documents, and decisions, the surface area of what needs to be coordinated explodes. It’s not just keeping 5 people aligned — it’s keeping 5 people and their dozens of concurrent workstreams aligned.

Small teams navigate this naturally. At 3-5 people, you can maintain a shared mental model of the entire project. You know what everyone is working on. You can make decisions in a quick conversation rather than a formal process. You can review each other’s AI-generated output without it becoming a full-time job.

Beyond that size, the coordination cost starts eating the productivity gains from AI. You need project managers, alignment meetings, documentation standards, approval workflows — all the overhead that slows large organizations down. The AI makes each individual more productive, but the organization can’t absorb that productivity because it gets lost in coordination.

What Small Teams Actually Need

If small teams are the future, what do they need from their tools?

Individual leverage. Each person needs to be maximally effective on their own. That means deep AI integration — not a chatbot in a sidebar, but an AI agent that can read your codebase, execute tasks, create artifacts, and maintain context across long working sessions. The individual unit of work needs to be powerful.

Shared context without overhead. The team needs to stay aligned without status meetings and Slack threads. That means shared workspaces where you can see what others are working on, what their AI agents have produced, and what decisions have been made — all without anyone having to write a status update.

Lightweight collaboration. When you do need to coordinate, it should be fast and direct. Review someone’s AI-generated code. Comment on a plan document. See what changed in the last session. Not through a heavyweight process, but through tools that make collaboration feel as natural as working alone.

Unified workspace. Small teams can’t afford to juggle 10 different tools. They need their documents, code, diagrams, plans, and conversations in one place — not because “all-in-one” is a feature, but because context switching is the enemy of the tight feedback loops that make small teams fast.

What We’re Building

This is exactly what we’re building with Nimbalyst.

For individual work, Nimbalyst gives you an AI-native workspace where Claude Code agents run directly in your editor. You can write code, create documents, draw diagrams, manage plans, and execute complex multi-step tasks all with an AI agent that has full context on your project. You’re not copy-pasting between a chat window and your tools. The agent works where you work.

For collaboration, we’re building the connective tissue that keeps small teams aligned. Shared workspaces where your teammates’ work is visible. Session histories that let you understand what happened while you were away. The goal is to give small teams the shared awareness that large organizations try to achieve with meetings and managers — but without the overhead.

We use Nimbalyst to build Nimbalyst. We’re a small team. We run agent sessions for everything from feature development to marketing content to bug fixes. And every day we learn what works and what’s missing when a handful of people are trying to move fast with AI.

The Bet

Our bet is simple: the teams that will build the best products over the next decade will be small. Not because small is trendy, but because the economics have fundamentally shifted. AI agents give individuals leverage that used to require headcount. And the coordination costs of large teams haven’t gotten any cheaper — if anything, they’ve gotten worse as the pace of AI-assisted work has accelerated.

3-7 people, each with their own teams of agents, sharing a workspace, moving fast, staying aligned. That’s the unit we’re building for. That’s the future we see.

No More Marketing Bottleneck: How We Automated the Software-to-Marketing Pipeline

Karl Wirth — Mon, 02 Mar 2026 15:34:08 +0000

Most companies have a gap between shipping software and telling the world about it. A feature is built. Then someone writes release notes. Someone else updates the docs. Someone asks for screenshots. A marketer rewrites the changelog for the website. A designer records a demo video. Weeks pass. The marketing site still describes the old version.

This problem might have been tolerable pre-AI, but at the speed with which we are releasing features with coding agents, it is unacceptable.

We closed that gap. At Nimbalyst, when we release a feature, we run a coding-agent powered pipeline to update our documentation, synch it to GitBook, generate our website content for Cloudflare, cut product videos and screenshots with Playwright, and update the website with those videos.

The Problem: Every Feature Ships Twice

Building a feature is one job. Telling people about it is a second job that's just as much work:

Release notes need to be written and formatted
Documentation in needs new pages or updated sections with accompanying videos and screenshots
Website copy needs to reflect the new capability
Screenshots need to be captured in both light and dark mode
Product videos need to show the feature in action
Changelog entries need to be added

In most teams, each of these is a manual handoff. Developer finishes the feature, writes a Slack message, product manager drafts docs, marketer updates the website, designer records a video. The chain is slow, lossy, and nobody owns the whole thing.

We decided the whole thing should be a coding-agent-powered pipeline.

The Automated Pipeline

Here's what happens when we ship a feature for Nimbalyst:

1. Feature ships → Content gets written

When a feature is released, in Nimbalyst, we ask our coding agent (use both Claude Code and Codex) to write a short markdown document describing the feature for our website and our documentation. It leverages the feature plan document, the code itself, and the release notes to do so. We edit what it wrote and iterate on this with the coding agent.

When we are satisfied with the response, we instruct the coding agent to change the documentation in github, update the relevant YAML data files that drive our website copy, and write a blog.

2. Content → YAML Files → Website deploy → Cloudflare Pages

Our website copy lives in YAML files in a git repo, not a CMS. That means Claude Code can edit marketing content the same way it edits source code. There's no proprietary editor to navigate, no API to call, no format to translate. A feature description becomes website copy in the same git repo, in the same session, reviewed with the same diff tools.

When needed, we open the YAML files that were changed in Nimbalyst either to spot check or to make some manual edits that are easily made by hand then described. We might not like what the coding agent did and then we ask for a rework or addition.

We run our staging site locally with npm dev and can review the updated page there.

Our marketing site is Astro on Cloudflare Pages. Git push to main and it deploys. Updated YAML copy, new blog posts, revised feature descriptions — they all go live with a git push. You can see the result at nimbalyst.com.

As noted, we follow this same process for our new blog, but we have a /blog command for this.

3. Docs update → GitBook syncs

Our documentation lives in a markdown file that syncs to GitBook. When the coding agent writes or updates a doc, it again is just a markdown file in a git repo. Push to main, GitBook picks it up. No manual copy-paste into a docs platform, no separate editing workflow.

Again, we review the changes made in the markdown file in Nimbalyst and iterate on it with the coding agent, for example, asking for a mermaid diagram or excalidraw to illustrate a key point.

Our coding agents write docs that are accurate because the have the full context — they can read the source code, the plan, understand the implementation, and write documentation that matches what the feature actually does, not what someone remembers it doing. See the finished product at docs.nimbalyst.com.

4. Text → Screenshots and Videos → Live on our Docs and Website

So far so good, but text is easier. What about the images and videos needed to explain the feature. We use Playwright to capture product videos directly from our running application. Not screen recordings where someone clicks through a demo — automated, scripted captures that show exactly the workflow we want, every time.

Our YAML data files include structured screenshot comment blocks that describe what each image or video should show (editable by human, updatable by agent). Playwright reads these specs, sets up the application state, and captures the assets. Light mode and dark mode variants. Specific crop regions — full window, editor pane, sidebar, toolbar, zoom.

When the product changes, we re-run the Playwright pipeline and get updated visuals that match the current UI. No re-record. No "the screenshots are from three versions ago" problem.

Each Playwright spec choreographs a complete sequence: launch the app, set up realistic data, navigate to the right screen, trigger AI interactions, and capture at the exact moment the UI tells the story. The specs capture both light and dark theme variants automatically, and video specs include a DOM-injected cursor that shows natural mouse movement — making the footage look like a real person using the product, not a robotic test run.

The screenshots and videos follow the same pipeline as text content. They are stored in git and pushed to the website and/or the documentation site or included as part of the blog. We use the same screenshot and video creation process to make assets for social when we need them.

5. Everything is connected and reviewable

Every piece of this pipeline produces artifacts in a git repo: changelog entries, documentation, website copy, screenshot specs, images, videos. We can have our coding agent read, leverage, and update every aspect of the pipeline and website. And we humans can do the same... reviewing and approving the agent's changes with red/green diffs and editing and updating all in markdown, code files, html mockups, and sessions. We see exactly what changed, can revert anything, and have a complete history of every marketing asset.

Why This Matters

Speed compounds

A single feature update might touch five different marketing surfaces. If each takes an hour of manual work, that's half a day per feature. Ship ten features a month and you've lost a full week to marketing maintenance. Ship 10 features every few days and you simply cannot keep up if this is manual.

Automated, the same updates take minutes. The content is drafted, reviewed as a diff, and deployed in one session.

Accuracy by default

When Claude Code or Codex write docs and marketing copy, each is reading the actual source code, plan files, specs, and diagrams. The feature description on the website matches what the feature does because they're derived from the same context. No telephone game between developer, PM, and marketer where details get lost or simplified incorrectly.

Visuals stay current

The Playwright pipeline means our screenshots and videos always show the current product. Every time the UI changes, we regenerate assets. The marketing site never shows an outdated interface — a problem that plagues every fast-moving product.

One repo, one workflow

Documentation, marketing copy, blog posts, changelog entries, screenshot specs — they all live in git repos. We use the same tools for all of them: Nimbalyst for editing and review, Claude Code for drafting and updates, git for version control, Cloudflare and GitBook for deployment.

There's no context-switching between a CMS, a docs platform, a design tool, and a video editor. It's all code and content in the same workspace.

The Nimbalyst Difference

You could build this pipeline with any coding agent in a terminal. But the full loop works better in a visual workspace like Nimbalyst:

Visual review matters. When Claude Code or Codex updates website copy, we see the rendered result in the same workspace. When it writes a blog post, we see formatted markdown. When Playwright captures a screenshot, we can view it immediately. And not just see and review these files, we can visually edit in Nimbalyst and collaborate iteratively with the coding agents.

Parallel sessions keep the pipeline moving. We run Codex and Claude Code sessions in parallel. Nimbalyst manages these sessions so we can coordinate across them without losing context. You can easily see what files each session changed and open and edit them, the status of sessions, find and resume sessions, and commit from the session with AI assist.

Diffs are the review mechanism. Every change Claude Code makes, whether it's source code, YAML copy, or markdown docs, shows up as a reviewable diff. We approve marketing changes with the same rigor we approve code changes. That's only practical in a workspace designed for it.

What We Learned

Treat marketing content as code. We moved our website copy into YAML files, our website into github pushed to Cloudflare, and our docs into markdown, which made it possible to automate the entire marketing surface. If your content is trapped in a proprietary SAAS editor or a proprietary CMS, your coding agent will have a harder time with it.

Invest in structured specs. Our screenshot comment blocks in YAML files seem like overhead but they let Playwright regenerate every marketing visual automatically. The upfront structure pays for itself every time the product changes.

The loop matters more than any single step. Automating docs alone is useful. Automating website deploys alone is useful. But automating the full loop — feature to docs to website to videos is qualitatively different. Nothing falls through the cracks because there are no cracks.

Ship features and ship the story simultaneously. When the marketing pipeline is automated, you don't have a backlog of "features we shipped but haven't told anyone about." The story ships with the feature. Your website is always current. Your docs are always accurate. Your videos always show the real product.

We went from a world where marketing was a separate project that lagged behind engineering to one where they're using the same tools at the same speed. See the finished product on our website and documentation.