Forem: Shane Wilkey

Your RAG Chatbot Lies Because You're Chunking Wrong

Shane Wilkey — Sun, 12 Apr 2026 03:38:27 +0000

Three weeks into building FolioChat — a chatbot that lets portfolio visitors talk to my GitHub — I had a system that gave answers like a drunk Wikipedia editor.

"Tell me about Shane's React work," someone would ask. It would respond with a rambling paragraph about my Python scripts, a random commit message, and somehow my college graduation year. The context was drifting so badly I was getting ready to scrap the whole thing.

The fix turned out to be a single architectural decision that most RAG tutorials skip entirely: chunking by meaning instead of by size.

The Problem: You're Splitting Text, Not Meaning

RAG — Retrieval Augmented Generation — sounds simple in tutorials. Chunk your documents, embed them, store them in a vector database, retrieve relevant chunks, answer questions.

Most RAG implementations break at step one: chunking. Token-based splitting, the default in every tutorial, treats your content like a deck of cards to be shuffled. It doesn't care if it splits a function definition from its docstring, or cuts a project description in half. Here's what I started with:

# Seemed reasonable. It was not.
def simple_chunk(text, chunk_size=500):
    tokens = text.split()
    for i in range(0, len(tokens), chunk_size):
        yield " ".join(tokens[i:i + chunk_size])

When someone asked "What Python projects has Shane built?", the system would retrieve:

Half of a README from a React project
The middle of a commit message about fixing tests
A random slice of documentation that mentioned Python once

Token-based splitting fragments meaning across chunks. The model never sees a complete thought — just shards of context that fit a size budget. So it synthesizes these fragments into something grammatically correct and factually useless.

The Fix: Chunk by Meaning, Not by Size

I almost killed the project before a frustrated Claude Desktop session saved it. The suggestion was simple: chunk by meaning instead of by size.

GitHub repositories have natural semantic boundaries. A README has sections. Projects have distinct characteristics. Code has logical groupings. The data has shape — I was just ignoring it.

The real Chunker in FolioChat uses a plain dataclass:

@dataclass
class Chunk:
    id: str
    type: str  # identity | project_overview | project_detail | project_tech | project_story
    content: str
    metadata: dict = field(default_factory=dict)

    def __post_init__(self):
        self.content = self.content.strip()

Five chunk types, each serving a specific retrieval purpose:

identity — one chunk per portfolio, who this developer is at the highest level
project_overview — elevator pitch per repo, answers "tell me about X"
project_tech — pure stack signal per repo, answers "do you know PostgreSQL?"
project_story — the why, built from README intro and commit history
project_detail — one chunk per README section, architecture stays with architecture

The Chunker class builds all of these from a single portfolio_data dict:

class Chunker:
    def chunk(self, portfolio_data: dict) -> list[Chunk]:
        chunks = []

        chunks.append(self._identity_chunk(portfolio_data))

        for repo in portfolio_data["repos"]:
            chunks.append(self._overview_chunk(repo, portfolio_data["username"]))
            chunks.append(self._tech_chunk(repo, portfolio_data["username"]))
            chunks.append(self._story_chunk(repo, portfolio_data["username"]))
            chunks.extend(self._detail_chunks(repo, portfolio_data["username"]))

        return [c for c in chunks if len(c.content) > 20]

When someone asks "What technologies does Shane use?", retrieval targets project_tech chunks instead of gambling on arbitrary text fragments. The question shape matches the chunk shape.

Three Embedding Backends, One Interface

Semantic chunking is the hard part. Embedding, by comparison, is a config choice. FolioChat supports three backends — local (free, no API key), OpenAI, and Voyage — all behind a common interface:

class LocalEmbedder(BaseEmbedder):
    MODEL_NAME = "all-MiniLM-L6-v2"

    def __init__(self):
        from sentence_transformers import SentenceTransformer
        self._model = SentenceTransformer(self.MODEL_NAME)

    def embed(self, texts: list[str]) -> list[list[float]]:
        embeddings = self._model.encode(texts, convert_to_numpy=True)
        return embeddings.tolist()

    def embed_query(self, text: str) -> list[float]:
        return self.embed([text])[0]

    @property
    def dimension(self) -> int:
        return 384

The local embedder runs entirely on your machine — 384-dimension vectors, no API key, no cost. Swap backends with one config change — the rest of your code doesn't know the difference.

The ChromaDB Gotcha Everyone Hits

ChromaDB expects string values in metadata. That sounds fine until you try to store lists or datetimes:

# This breaks ChromaDB
metadata = {
    "languages": ["Python", "JavaScript"],  # lists don't serialize
    "created_at": datetime.now()            # datetimes don't serialize
}

# This works
metadata = {
    "languages": json.dumps(["Python", "JavaScript"]),
    "created_at": datetime.now().isoformat()
}

FolioChat serializes lists to JSON strings before storing and deserializes on the way out. Annoying the first time, invisible after that.

Before and After: What Semantic Chunking Actually Fixed

The drunk Wikipedia editor became a chatbot that gives accurate, specific answers. The difference between "Shane has worked on various projects involving different technologies over time" and "Shane's FolioChat project is a RAG pipeline using FastAPI, ChromaDB, and sentence-transformers, deployed on Railway" is the RAG chunking strategy — nothing else.

The full implementation — chunker, embedder, ChromaDB layer, FastAPI server, and React widget — is here. Install it, point it at your GitHub username, and have a working portfolio chatbot in about five minutes.

Next week: the complete FolioChat architecture walkthrough, and why most developer portfolios are effectively invisible to the people looking at them.

Tags: rag, python, chromadb, embeddings, llm
Series: Building With AI Agents — Article 5 of 12

From GitHub Issue to Merged PR: My Complete AI-Powered Development Workflow

Shane Wilkey — Thu, 09 Apr 2026 04:04:25 +0000

A few months ago, I was burning through an embarrassing amount of Claude tokens monthly while shipping maybe two meaningful features. The wake-up call came during a debugging session where I spent the better part of an hour having Claude explain why the same function worked differently in different contexts—only to discover I'd been feeding it outdated code samples from three iterations ago.

The Cost of Wing-It Development

Here's what broken AI development looks like: you start a conversation with Claude about adding user authentication, thirty messages later you're discussing database migrations, then somehow you're refactoring your entire error handling system. Each context switch burns tokens, and worse, the AI starts hallucinating solutions based on code it thinks exists but doesn't.

I was treating Claude like Stack Overflow with a conversation interface—throwing problems at it reactively instead of systematically. The result? Half-implemented features, context drift that made debugging impossible, and token bills that made me question whether AI development was worth it.

Planning Is Non-Negotiable

The breakthrough came when I stopped coding first and started planning first. Every feature now begins with a structured conversation where Claude and I define three things before touching any code:

Context boundaries — exactly what codebase state we're working from, what's been changed recently, and what assumptions we can make. No "you know that auth system we built" — explicit file references and current state.

Tool constraints — which APIs we're using, what our deployment pipeline expects, and what our testing setup can handle. Claude needs to know it's writing for GitHub Actions, not generic CI.

Success criteria — what "done" looks like, not just functionally but how we'll know the code is maintainable six months from now.

This planning phase costs relatively little but saves a lot downstream. Claude generates much better code when it knows exactly what problem it's solving within defined constraints.

Document Everything, Start Fresh Sessions

The temptation is to keep one long conversation going because it feels efficient. It's not. After enough back-and-forth, Claude starts making assumptions based on earlier context that may no longer be accurate. I learned to recognize the warning signs:

Claude references code I never shared in the current session
Solutions become overly complex for simple problems
The AI starts suggesting fixes for problems that don't exist

Instead of fighting context drift, I document everything mid-conversation. When planning a feature, I have Claude generate a structured project brief that includes the technical approach, file structure, and key implementation notes. Then I save that brief and start a fresh session when it's time to code.

The brief becomes the handoff document between sessions. New Claude session gets the brief, current codebase state, and the specific task. No context drift, no hallucinated solutions.

Adversarial Code Review Catches Complacency

Here's something I discovered by accident: Claude gets comfortable with its own code. If I ask it to review code it just wrote, it tends to approve its own patterns even when they're problematic. The AI develops blind spots around its own work.

The solution is adversarial review using different models. After Claude writes a feature, I paste the code into a conversation with a different model and ask for a critical review without mentioning Claude wrote it. Different models catch different issues:

# Claude wrote this authentication decorator
def require_auth(f):
    def decorated_function(*args, **kwargs):
        if 'user_id' not in session:
            return redirect('/login')
        return f(*args, **kwargs)
    return decorated_function

Claude reviewed its own code: "Looks good, handles the authentication check properly."

The adversarial review: "This loses the function metadata and doesn't handle POST data properly on redirect. Also no CSRF protection."

Both criticisms were valid. Claude had gotten comfortable with a pattern that worked but wasn't robust. Cross-model review catches these blind spots before they become production issues.

The Complete Workflow in Action

Here's how a typical feature development cycle works now:

Brainstorming phase — I describe the feature need to Claude conversationally. We explore approaches, discuss tradeoffs, and identify the core complexity. This generates understanding, not code.

Planning phase — Claude creates a detailed project plan with milestones, breaking the feature into discrete GitHub issues. Each issue gets acceptance criteria, technical notes, and priority ranking.

## Issue: Add password reset functionality
**Priority:** P1
**Acceptance Criteria:**
- User can request password reset via email
- Reset links expire after 24 hours  
- Process works with existing auth system
**Technical Notes:**
- Extend User model with reset_token field
- Add email service integration
- Update auth middleware to handle reset flow

Project setup — Claude generates GitHub Projects configuration, complete with automation rules. I can literally copy-paste its output into GitHub's project setup interface.

Development — Each issue becomes its own focused Claude session using GitHub Codespaces. Claude gets the project brief, current codebase, and single issue context. Write code, test locally, commit.

Review — Different AI model reviews the code adversarially before I create the pull request.

The workflow scales because each step is self-contained and documented. I can pick up development from my phone using Codespaces, spend fifteen minutes with the Claude VS Code extension getting an issue implemented, then put it down again.

When It All Clicked

The system clicked during a weekend when I was traveling but wanted to knock out a few small issues. Using just my laptop and GitHub's web interface, I:

Created three new issues from user feedback
Had Claude prioritize them within the existing project
Used Codespaces to implement the highest priority fix
Got adversarial review from a different model
Merged the PR

What used to take half a day and leave me second-guessing every decision took under an hour — and I was confident in the output because the process held. That's the real payoff.

The key insight is that sustainable AI development isn't about finding the perfect prompt — it's about building repeatable systems that prevent the expensive mistakes. Planning upfront, documenting everything, and using fresh context prevents the token-burning death spirals that make AI development feel unsustainable.

Every system has overhead, but this overhead scales. The planning documents become project knowledge. The adversarial reviews catch patterns I need to improve. The structured workflow means I can develop from anywhere without losing momentum.

What's Next

Next up, we're dipping our toes into FolioChat to explore how RAG actually works for developers who build things. We'll see if retrieval-augmented generation can maintain context better than conversation history, and whether it's worth the complexity for real development workflows.

complete-ai-workflow

Tags: ai-development, claude, github, workflow, productivity

Series: Building With AI Agents — Article 4 of 12

From Chaos to Shipped — A Practical Workflow for Solo Developers

Shane Wilkey — Sat, 14 Mar 2026 06:13:16 +0000

If you've ever ended a coding session unsure what you actually finished, switched between projects and lost your train of thought, or deployed something only to realize you forgot to set an environment variable — this guide is for you.

I built this workflow to solve my own problems. Context-switching between projects was killing my focus, and half my best ideas were disappearing before I could act on them. I needed a system that would protect my creative momentum — not just track tasks, but give me enough structure to stay loose and keep building. What came out of that is a four-phase workflow I now use daily.

Solo development is uniquely hard. There's no standup to force clarity, no PR reviewer to catch the obvious, no project manager tracking what's in flight. The chaos isn't a character flaw. It's a systems problem, and systems problems have systems solutions.

This guide walks through the four phases — Capture → Plan → Build → Ship — designed around tools you're probably already using: a text editor, GitHub, and a bit of Markdown. No new apps required.

The four phases at a glance

Phase	Core habit
Capture	Drop ideas in `IDEA.md`; triage weekly; every real item becomes a GitHub issue
Plan	Write a feature brief (goal, scope, non-goals, done-when) before touching a branch
Build	Conventional commits + a breadcrumb note before every context switch
Ship	Self-review the diff, run the deploy checklist, write the retro note

Phase 1: Capture

The first place chaos enters is the moment an idea strikes. Most developers either ignore it (gone by morning) or immediately start building (hello, scope creep). The Capture phase gives you a third option: a frictionless landing spot that doesn't interrupt what you're currently working on.

The IDEA.md file

Keep a single IDEA.md per project — or one global backlog. When a thought surfaces, drop a one-liner:

- [2025-03-14] [QuizQuest]  add breadcrumb nav to lesson view
- [2025-03-14] [BAK EOS]    auto-fill shift date from system clock
- [2025-03-15] [QuizQuest]  dark mode toggle in settings

No polish. No planning. Just capture. The goal is to stop losing ideas to context, not to evaluate them in the moment.

Weekly triage

Once a week — end of Friday works well — go through IDEA.md and run each item through three questions:

Which project does this belong to?
What's the rough size — an hour, a day, or a sprint?
Is this worth a GitHub issue? If yes, create one and delete the line. If no, delete it anyway.

An idea that can't survive a 10-second triage wasn't worth building.

Turning an idea into a GitHub issue

Every real work item lives in GitHub. When something survives triage, open an issue using a consistent template. At minimum:

What you're building and why (one or two sentences)
What's explicitly in scope
What's explicitly out of scope — this is the most important line
A definition of done: how will you know it's complete?
A task checklist: the individual steps you'll take

Label the issue (feat, fix, chore, docs), assign it to a milestone if timing matters, and link it to your project board. The issue is now the single source of truth for that work item — your planning doc, task list, and history all in one.

Why this matters: Most solo dev chaos traces back to work that exists only in someone's head. GitHub issues externalize your intent. When you context-switch, get interrupted, or return to a project after two weeks away, the issue tells you exactly what you were building and why.

Phase 2: Plan

Planning feels like overhead until you've spent three hours building the wrong thing. The Plan phase is short by design: answer the questions that will otherwise derail you mid-build — before you start coding.

Write a feature brief

Before touching a branch, write two to four sentences directly on the GitHub issue:

Goal: what the feature accomplishes, from the user's perspective
Scope: what's included in this issue, specifically
Non-goals: what's explicitly deferred to later
Done when: a concrete, testable condition

The non-goals line is the most important. Writing "not touching the notification system yet" prevents the 2am realization that you've rebuilt half the app to support something that wasn't in scope.

Build a task checklist on the issue

GitHub issues support Markdown checkboxes. Use them. Break the feature into discrete tasks, each completable in a single session:

## tasks

- [ ] design Supabase schema for coach_sessions table
- [ ] write pgvector similarity search query
- [ ] build useCoach hook with API call
- [ ] wire hook to CoachPanel component
- [ ] add loading/error states
- [ ] smoke test with 3 real quiz attempts

This checklist serves three purposes: it's your to-do list while building, a progress indicator visible on the issue, and a record of what actually shipped.

Cut the branch

Name branches with a consistent pattern: feat/42-socratic-coach-ui

Component	Purpose
`feat`	Type: `feat`, `fix`, `chore`, `docs`
`42`	Issue number — ties the branch to your planning
`socratic-coach-ui`	Short slug describing the work

Never work directly on main. The branch name alone should tell you what you were building and where to find the context.

Phase 3: Build

The Build phase is where the actual work happens, and where context gets lost. Two habits make the difference: disciplined commits and breadcrumb notes.

Conventional commit messages

Conventional commits make your git log readable, your diffs reviewable, and automated tooling possible. The format:

type(scope): short description

feat(coach): add pgvector similarity search
fix(auth): redirect loop on expired session token
chore: upgrade next to 15.2.1
docs(api): document /modules endpoint
feat(api)!: rename /lessons to /modules    # breaking change

Type	Use when...
`feat`	adding new functionality
`fix`	correcting a bug
`chore`	tooling, config, dependency updates
`docs`	documentation only
`refactor`	restructuring without behavior change
`test`	adding or updating tests
`perf`	performance improvements

Rules:

Imperative mood: "add" not "added"
72 characters or fewer in the subject line
One logical change per commit
Reference the issue in the body: Closes #42

The breadcrumb habit

This is the single highest-leverage habit for multi-project solo development. Before closing your editor or switching projects, leave a note indicating exactly where you stopped and what comes next.

In the code:

// TODO: next — wire Supabase call in useCoach.ts
// Left off: schema works, hook needs error boundary

Or as a comment on the GitHub issue:

Pausing here. Completed: schema, pgvector query.
Next: build useCoach hook, then CoachPanel wiring.
Blocker: need to check Supabase RLS policy for coach_sessions.

The real cost of skipping this: Without a breadcrumb, returning to a project after even two days means 20–30 minutes of re-orienting: re-reading diffs, re-opening tabs, re-figuring out intent. Multiply that across three active projects and it's a real tax on every session.

Draft PRs as checkpoints

For features spanning multiple sessions, open a draft pull request early. It gives you a visual diff of everything committed so far, a comment thread for notes and decisions, and a psychological anchor — the work is in progress, not in limbo.

Prefix the title with [WIP] or use GitHub's native Draft PR feature. Convert to ready when all checklist items on the issue are ticked.

Phase 4: Ship

Shipping is where most solo projects fumble — not because the code is wrong, but because the surrounding process gets skipped. A five-minute pre-deploy check prevents the 11pm "why is production broken" investigation.

PR self-review before merging

Read your own diff before merging. Before you hit merge, verify:

The linked GitHub issue will be closed by this merge
All checklist items on the issue are ticked
No console.log, debug code, or commented-out blocks remain
Any changed behavior is reflected in docs or CHANGELOG
The PR description explains the change for future reference

Merge via squash to keep main history clean. One feature, one commit on main.

The deploy checklist

Run through a checklist before every production push. For a Next.js + Supabase stack:

Category	Check
Pre-push	`next build` passes locally with zero errors
Pre-push	TypeScript clean: `tsc --noEmit`
Pre-push	No debug code remaining
Environment	All new `.env` vars added to Vercel
Environment	No secrets in source code or commit history
Database	Migrations written and tested locally
Database	Migrations applied to production BEFORE deploying app
Post-deploy	Vercel build logs green
Post-deploy	Smoke test the happy path in production
Post-deploy	GitHub issue closed and PR merged

⚠️ Critical order: Always run database migrations before deploying the application code that depends on them. Deploying first creates a window where new code runs against the old schema — this is the most common source of production incidents in small Next.js + Supabase projects.

The retro note

After every deploy, write one sentence in RETRO.md:

2025-03-14 QuizQuest — deploy failed: migrations not run before push. Add to checklist.
2025-03-10 BAK EOS   — .env var missing in Vercel; caught it in smoke test.
2025-03-05 QuizQuest — feature took 3x longer: scope wasn't locked. Write the brief first.

Review RETRO.md monthly. When the same note appears three times, you have a process problem worth fixing at the root. These compound into real improvements over time.

Files to keep in every project

File	Purpose
`IDEA.md`	Idea backlog. Triage weekly. Promote to GitHub issues.
`RETRO.md`	One line per deploy. Review monthly.
`.github/ISSUE_TEMPLATE/feature.md`	Issue template. Auto-populates when you open a new issue.

The quick-reference cheat sheets

Here's the commit message reference card — keep it open in a tab until the format is second nature:

feat(scope): add thing          # new feature
fix(scope): correct thing       # bug fix
chore: update thing             # no prod code change
docs(scope): document thing     # docs only
refactor(scope): reshape thing  # no behavior change
feat(scope)!: rename thing      # breaking change

And branch naming:

feat/42-socratic-coach-ui
fix/38-auth-redirect-loop
chore/51-upgrade-next
docs/55-api-endpoint-reference

Wrapping up

This workflow is intentionally minimal. It lives entirely in GitHub and a couple of Markdown files you can drop into any project in under a minute. No new tools, no subscriptions, no onboarding.

I built it because I was tired of losing momentum between sessions and watching good ideas evaporate. Using it consistently gave me back creative space I didn't realize I'd lost — fewer "where was I?" mornings, more time actually building.

The habits that move the needle most, in order:

Breadcrumb notes — the fastest fix for context-switch pain
Feature briefs with non-goals — the fastest fix for scope creep
The deploy checklist — the fastest fix for production incidents
RETRO.md — the slowest fix, but the one that compounds

Start with one. Add the others as the pain point it solves becomes real.

Grab my free interactive workflow diagram, commit cheat sheet, deploy checklist, and other helpful resources here.

How I Turned a Bespoke Code Reviewer Into a Skill Any Project Can Use

Shane Wilkey — Fri, 13 Mar 2026 22:00:00 +0000

The review layer in Code Genie works. The problem is it's hardwired — the criteria, the persona, the output format are all written specifically for that project. Every new project that wants the same quality gate has to rebuild it from scratch. That's not a system, that's a copy-paste habit. Here's how I extracted it into a reusable Skill.

The Problem With Bespoke Logic

Here's what the Code Genie reviewer looks like in its original form:

reviewer = Agent(
    role="Senior Code Reviewer",
    goal="""Review code critically and objectively. Identify bugs, edge cases,
    convention violations, and anything that would fail in production.
    Do not give the benefit of the doubt.""",
    backstory="""You are a senior engineer with high standards and low tolerance
    for sloppy code. You've seen too many production incidents caused by code
    that looked fine in review. You are thorough, specific, and direct.""",
    llm=llm,
    verbose=True
)

review_task = Task(
    description="""Review the following Python code for a crewAI agent workflow.
    Check for: correct use of crewAI Agent and Task APIs, proper error handling
    for LLM failures, retry logic that won't loop infinitely, and clear separation
    of agent responsibilities. Return PASS with brief justification, or FAIL with
    specific actionable feedback.""",
    expected_output="PASS or FAIL with specific feedback",
    agent=reviewer
)

Look at the task description. It's reviewing Python code for a crewAI agent workflow. It's checking for correct use of crewAI APIs. It knows the retry logic is there and checks it specifically. Every line of that task is written for Code Genie and Code Genie only.

Drop this into a Next.js project and it'll review TypeScript like it's looking for crewAI API misuse. Drop it into a data pipeline and it'll flag missing retry logic that was never supposed to be there. The reviewer works — inside the project it was written for.

That's the cost of bespoke logic. It's not wrong. It's just not portable.

What "Reusable" Actually Means

Before extracting anything, it helps to define what you're extracting toward. A reusable Skill needs to do three things:

Work without project-specific assumptions baked in. The Skill can't know in advance what language you're using, what framework you're on, or what your conventions look like. That information has to come from outside — from CLAUDE.md, from the task prompt, from context passed in at runtime.

Accept context from whatever project it's dropped into. The Skill should have explicit slots for project-specific information. Not hardcoded defaults — actual inputs it expects to receive and knows how to use.

Produce output in a consistent format. Whatever calls the Skill — a crewAI agent, a GitHub Actions workflow, a human reading the result — should be able to rely on the output looking the same every time. PASS or FAIL, specific feedback, actionable items. No surprises.

That's the design spec. If the extracted Skill satisfies all three, it's genuinely reusable. If it satisfies two out of three, it's still bespoke with better formatting.

Identifying What's Generic vs What's Project-Specific

The extraction exercise is simple in theory: go through the original reviewer and put everything into one of two buckets.

Generic — applies to any code review, regardless of project:

The reviewer persona (skeptical, specific, direct)
The output format (PASS or FAIL with actionable feedback)
The core review criteria (correctness, edge cases, readability)
The instruction not to give the benefit of the doubt

Project-specific — belongs in CLAUDE.md or the task prompt, not the Skill:

The language and framework being reviewed
The specific APIs being checked
The architectural patterns to enforce
The conventions that count as violations

Here's the same reviewer split into those two buckets:

# GENERIC — goes into the Skill
role="Senior Code Reviewer"
goal="Review code critically. Identify bugs, edge cases, and anything that would fail in production."
backstory="You are a senior engineer with high standards. You are thorough, specific, and direct."
output_format="PASS with brief justification, or FAIL with specific actionable feedback"

# PROJECT-SPECIFIC — goes into the task prompt, sourced from CLAUDE.md
language="Python"
framework="crewAI"
specific_checks=["crewAI Agent and Task API usage", "retry logic", "agent separation"]
conventions="[from CLAUDE.md]"

Once it's split this cleanly, the Skill writes itself.

Writing the Skill

Here's the full SKILL_code_review.md, built from the generic bucket:

---
name: code-review
description: Review code critically and return a structured PASS or FAIL with
specific, actionable feedback. Use this skill whenever an agent or workflow needs
a quality gate on generated or submitted code. Trigger when a task involves
reviewing, evaluating, or checking code before it proceeds to the next step.
---

# Code Review Skill

A reusable code review layer for agent workflows and manual review tasks.

## Reviewer Persona

You are a senior software engineer with high standards and low tolerance for
code that merely looks correct. You have seen too many production incidents
caused by plausible-looking code that was never properly scrutinized. You are
thorough, specific, and direct. You do not give the benefit of the doubt.

## What You Always Check

Regardless of language or framework, every review covers:

1. **Correctness** — Does the code do what it's supposed to do?
   Does it handle the happy path? Does it handle failure?
2. **Edge cases** — What happens at the boundaries? Empty input,
   null values, unexpected types, off-by-one conditions.
3. **Readability** — Can another developer understand this code
   without asking questions?
4. **Side effects** — Does this code do anything it shouldn't?
   Modify state it doesn't own? Make calls it wasn't asked to make?

## What You Check From Project Context

The following criteria come from the project's CLAUDE.md and task prompt.
Apply them in addition to the universal criteria above:

- Language and framework conventions
- Architectural patterns to enforce or avoid
- Specific APIs or libraries in use
- Anti-patterns explicitly called out in CLAUDE.md

## Output Format

Return exactly one of the following:

**PASS**
Brief justification (1-2 sentences). What makes this code acceptable.

**FAIL**
Specific feedback only. For each issue:
- What the problem is
- Where it is (function name, line reference, or code snippet)
- What the fix should be

Do not return vague feedback. "This could be improved" is not actionable.
"The retry loop on line 24 has no exit condition — add a max_retries check
before the recursive call" is actionable.

That's the complete Skill. Generic enough to work anywhere, structured enough to produce consistent output, with explicit slots for project context to plug into.

Plugging It Back Into Code Genie

With the Skill written, the original Code Genie reviewer slims down considerably. The agent definition drops to its essentials and delegates to the Skill. The project-specific criteria move into the task prompt where they belong:

reviewer = Agent(
    role="Senior Code Reviewer",
    goal="Apply the code-review Skill to evaluate the submitted code.",
    backstory="You follow the code-review Skill process precisely.",
    llm=llm,
    verbose=True
)

review_task = Task(
    description="""Using the code-review Skill, review the submitted Python code.

    Project context (from CLAUDE.md):
    - Language: Python 3.11
    - Framework: crewAI
    - Conventions: agents have single responsibilities, tasks return structured output
    - Anti-patterns: no raw exception swallowing, no infinite retry loops

    Apply both the universal criteria from the Skill and the project context above.
    Return PASS or FAIL with specific feedback.""",
    expected_output="PASS or FAIL with specific feedback per the code-review Skill format",
    agent=reviewer
)

The loop still works exactly as before. The output format is identical. But the review layer is now a dependency rather than a hardcoded implementation — and that distinction matters the next time you need a quality gate somewhere else.

Plugging It Into a Different Project

Here's the same Skill wired into a completely different context — a Next.js TypeScript project with no connection to crewAI:

review_task = Task(
    description="""Using the code-review Skill, review the submitted TypeScript code.

    Project context (from CLAUDE.md):
    - Language: TypeScript strict
    - Framework: Next.js 16, React 19
    - Conventions: named exports, Result pattern for error handling,
      no raw throws in business logic
    - Anti-patterns: no default exports, no useEffect for data fetching,
      do not mutate props

    Apply both the universal criteria from the Skill and the project context above.
    Return PASS or FAIL with specific feedback.""",
    expected_output="PASS or FAIL with specific feedback per the code-review Skill format",
    agent=reviewer
)

Same Skill. Different language, different framework, different conventions. The Skill adapts because the project-specific context is coming from the task prompt — sourced from CLAUDE.md — not baked into the Skill itself. The reviewer persona, the universal criteria, the output format are all unchanged.

That's portability in practice.

When to Extract vs When to Leave It Bespoke

Not everything should be a Skill. The extraction is worth the effort when all three of the following are true:

The logic is genuinely reusable. If you can't imagine using it in at least two different projects, it probably belongs where it is. Extraction for its own sake adds maintenance overhead without payoff.

The project-specific parts can be cleanly separated. If you spend more time trying to untangle what's generic from what's specific than you would just rewriting it for the next project, leave it bespoke. The extraction exercise should feel like sorting, not surgery.

You'll actually use it again. This one sounds obvious but it's easy to miss. Building a reusable Skill you never reach for is just premature abstraction with extra steps.

If all three are true, extract it. If one or two are true, extract the concept but not necessarily the implementation — document what worked and use it as a reference when the next similar problem comes up.

The code review Skill cleared all three bars easily. The reviewer persona and output format are the same everywhere. The project-specific criteria separated cleanly into the task prompt. And a quality gate on generated code is useful in every project that generates code.

What's Next

The code review Skill is now in the [ai-workflow-toolkit](https://github.com/southwestmogrown/ai-workflow-toolkit) repo alongside the CLAUDE.md generator and the README writer. Three skills, one repo, all portable.

The pattern behind all three is the same: build it bespoke first, extract what's generic, make it portable. Resist the urge to abstract before you've built something real — you can't know what's generic until you've seen the specific version work.

Next up: how I'm using these building blocks to wire up a full agent workflow on GitHub issues — from writing the issue to merging the PR, with Claude handling every step in between.

Tags: python, ai, crewai, webdev
Series: Building With AI Agents — Article 3 of 12

Code Genie: I Built a Self-Reviewing Code Generator with CrewAI

Shane Wilkey — Fri, 13 Mar 2026 03:56:17 +0000

Most AI coding tools are one-shot: you ask, they answer, you decide if it's good. That's not an agent — that's autocomplete with better vocabulary. Code Genie works differently. It writes code, reviews it, finds its own problems, and iterates until it's satisfied. I built it by hand while learning crewAI, and the process taught me more about agentic systems than any tutorial ever could.

The Problem With One-Shot Code Generation

If you've used Claude, ChatGPT, or any AI coding tool for more than a week, you've hit the wall.

You write a prompt. You get back code that looks right. You drop it in, run it, and something breaks — not catastrophically, just quietly. An edge case that wasn't handled. An assumption baked in that doesn't match your data. A function that works in isolation but fails in context.

The problem isn't that the model is bad. The problem is that nobody checked the output before it reached you.

Human code review exists for exactly this reason. A second set of eyes catches things the original author missed — not because the reviewer is smarter, but because they're reading it fresh, looking for problems instead of assuming it works. One-shot AI tools skip that step entirely. They generate, hand off, and move on. The gap between "looks right" and "is right" is yours to close.

I wanted to close it automatically.

What an Agentic Loop Actually Is

An agentic loop is: generate → evaluate → decide → repeat.

That's it. Three steps and a condition. What makes it agentic is the decide step — something has to determine whether the output is good enough to stop, or whether it needs another pass. Without that decision layer, you just have automation. With it, you have something that can improve its own output without a human in the loop.

Applied to code generation, the loop looks like this: a writer produces a snippet, a reviewer evaluates it against a set of criteria, the reviewer's verdict either passes the output or sends it back with specific feedback, and the writer tries again with that feedback in hand. The loop runs until the output passes review or hits a retry limit.

Simple pattern. Surprisingly powerful in practice.

Why crewAI

crewAI is a Python framework for building multi-agent systems. The core abstraction is exactly what it sounds like: you define a crew of agents, give each one a role, a goal, and a backstory, assign them tasks, and let the framework handle the orchestration.

crewAI manages the execution order, passes output between agents, and keeps the loop running. You don't write the plumbing. You write the agents and the tasks, and crewAI connects them.

I chose it because I wanted to learn agentic patterns hands-on, and crewAI's abstractions are close enough to how you'd think about the problem naturally that the learning curve doesn't get in the way of actually building. A writer and a reviewer are intuitive roles. Defining them as crewAI agents felt like a natural translation of the mental model.

What crewAI doesn't handle — and this is important — is your judgment about what good code means. The framework will run your loop faithfully. It won't tell you whether your review criteria are any good. That part you have to figure out yourself, and that's where most of the real work lives.

The Writer Agent

The writer agent is the simpler of the two — its job is straightforward. Given a task, produce code that accomplishes it.

In crewAI terms, the writer looks something like this:

writer = Agent(
    role="Software Engineer",
    goal="Write clean, correct, production-ready code that solves the given task",
    backstory="""You are an experienced software engineer who writes clear,
    well-structured code. You follow established conventions, handle edge cases,
    and write code that other developers can read and maintain.""",
    llm=llm,
    verbose=True
)

The task it receives includes the prompt — what to build — plus any context about the project: language, conventions, constraints. This is where the system from article one plugs directly in. If the repo has a CLAUDE.md, that context gets passed to the writer before it generates a single line. If there's a code-writing Skill, that gets included too. The writer doesn't start cold — it starts knowing your project the same way it would in any other session.

The first pass isn't expected to be perfect. It's expected to be a serious attempt — something substantive enough for the reviewer to actually evaluate. That framing matters. You're not asking for perfection on the first try. You're asking for something worth reviewing.

The Reviewer Agent

The reviewer is where the real design decisions live.

The naive approach is to have Claude re-read its own output and ask "is this good?" That doesn't work well. The model has too much context about what it was trying to do — it reads charitably, filling in gaps with intent rather than scrutinizing what's actually there.

The fix is persona separation. The reviewer is defined as a different agent with a different goal — not "did this accomplish what was intended" but "what's wrong with this code."

reviewer = Agent(
    role="Senior Code Reviewer",
    goal="""Review code critically and objectively. Identify bugs, edge cases,
    convention violations, and anything that would fail in production.
    Do not give the benefit of the doubt.""",
    backstory="""You are a senior engineer with high standards and low tolerance
    for sloppy code. You've seen too many production incidents caused by code
    that looked fine in review. You are thorough, specific, and direct.""",
    llm=llm,
    verbose=True
)

The backstory isn't decoration — it shapes behavior. A reviewer told to be skeptical and specific produces meaningfully different feedback than one told to be helpful. Same model, different instruction set, genuinely different output.

The reviewer's task is to return one of two things: a pass with brief justification, or a fail with specific, actionable feedback. Vague feedback — "this could be improved" — is useless in a loop. The writer needs to know exactly what to fix.

The Loop

With the writer and reviewer defined, the loop itself is surprisingly clean in crewAI:

crew = Crew(
    agents=[writer, reviewer],
    tasks=[write_task, review_task],
    process=Process.sequential,
    verbose=True
)

Sequential process means the writer runs first, hands off to the reviewer, and the reviewer's output determines what happens next. If the review passes, the loop exits and returns the code. If it fails, the feedback gets passed back to the writer as context for the next attempt.

The retry limit is non-negotiable. Without it, a loop that keeps failing will keep running — and Claude API calls are not free. In practice, three iterations is the sweet spot. If the code hasn't passed review by the third attempt, something is wrong with either the task definition or the review criteria, and a human should look at it. The loop surfaces the best attempt and stops.

What you learn quickly is that the loop is only as good as the reviewer's criteria. A reviewer that's too strict loops forever. A reviewer that's too lenient rubber-stamps bad code. Calibrating that bar — specific enough to catch real problems, reasonable enough to actually pass good code — is where most of the iteration happens when you're building this.

What Building It By Hand Taught Me

The honest answer is that the first version was too strict.

The reviewer's criteria were thorough — maybe too thorough. It flagged missing docstrings on every function, objected to variable names it considered insufficiently descriptive, and refused to pass anything that didn't include explicit error handling regardless of whether the task called for it. Almost nothing passed on the first review. The loop ran to the retry limit constantly, burning tokens on debates about naming conventions instead of catching real bugs.

The fix was humbling: I had to think harder about what "good code" actually means in context. A reviewer that holds a utility snippet to production application standards isn't useful — it's just expensive. Tightening the review criteria to focus on correctness and edge cases, and leaving style to the CLAUDE.md conventions layer, is what made the loop productive rather than just busy.

The second thing I learned is that verbose output is worth the cost while you're building. crewAI's verbose=True flag logs every agent action, every task handoff, every decision point. It's noisy. It's also the only way to understand why the loop is behaving the way it is. I left it on far longer than I needed to, and I don't regret it. You can't debug a loop you can't see.

The third thing — and this one surprised me — is that the reviewer persona genuinely changes the output. I expected the separation to be mostly cosmetic, a way of organizing prompts cleanly. It's not. A reviewer told to be skeptical and specific catches things that a neutral re-read misses. The backstory isn't decoration. It's instruction.

The moment it clicked was a loop run where the reviewer caught an off-by-one error the writer had introduced in a retry — an error that wasn't in the first attempt, only appeared in the revision, and would have been invisible in a one-shot workflow. That's the whole point, in one example.

What's Next

Code Genie is still early. The loop works, the output is measurably better than one-shot generation for the cases it's designed for, and the crewAI abstractions held up well enough that I'd use the framework again.

But the review layer is still bespoke — written specifically for Code Genie, not portable to other projects. The next logical step is extracting it into a reusable Skill: a code review Skill that any project can plug into its own workflow without building the full loop from scratch. That's the next article.

If you're thinking about building something similar, start with the reviewer, not the writer. The writer is easy — Claude already knows how to write code. The hard part is defining what "good enough" means precisely enough that a loop can act on it. Get that right first and the rest follows.

Tags: crewai, agentic-ai, code-generation, ai-code-review, python
Series: Building With AI Agents — Article 2 of 12

How I Made Claude Actually Understand My Codebase

Shane Wilkey — Thu, 12 Mar 2026 08:06:18 +0000

Most developers use Claude like a smart search engine — paste code, get answer, move on. That works until you realize how much you're paying for it. Token costs add up fast when Claude has to rediscover your codebase from scratch every single session. A CLAUDE.md file and a /skills folder fix that.

The Problem

If you've been using Claude as a development tool, you've probably noticed a pattern. You open a new conversation, paste in some code, ask your question, get a decent answer, and close the tab. Next time you need help, you start over. Paste the code again. Re-explain the context again. Watch Claude make the same reasonable-sounding wrong assumptions about your stack again.

That cycle has a cost — and I don't just mean the token bill, though that's real. Every time Claude starts cold, you're paying to re-establish context that should already be there. You're burning tokens on boilerplate explanation instead of actual problem solving. And the suggestions you get back are generic, because Claude is pattern-matching against codebases it's seen before — not yours.

This isn't a Claude problem. It's a workflow problem. Claude has no memory between sessions by design. Without a system to feed it context, you'll keep getting capable-but-wrong answers no matter how good the model gets.

Enter CLAUDE.md

The first piece of the system is a file called CLAUDE.md. It lives in the root of your repository, and Claude reads it automatically at the start of every session. No prompting required. No copy-pasting context into the chat. It's just there.

Think of it as an onboarding document — except instead of writing it for a new team member, you're writing it for your AI collaborator. The goal is the same: get them up to speed on what this project is, how it's structured, and what the rules of the road are before they touch a single line of code.

A good CLAUDE.md answers the questions Claude would otherwise have to guess at. What's the stack? What's the architecture? What patterns do we follow here? What should never happen in this codebase? Where does business logic live? What does a good commit message look like?

Without it, Claude fills those gaps with assumptions. With it, Claude walks into every session already knowing your project the way a senior dev would after a solid onboarding week.

The investment is maybe an hour to write it the first time. The return is every future session starting from the right place instead of zero.

What Goes In It

There's no rigid spec for a CLAUDE.md — but after building several of them across different projects, a few sections show up every time.

Stack and environment. List your languages, frameworks, major libraries, and any tooling that isn't obvious from the file structure. Include versions where they matter. Claude should never have to guess whether you're on React 17 or 19, or whether you're using Prisma or raw SQL.

Architecture overview. A short description of how the project is structured at a high level. Is it a monolith or microservices? Where does the frontend live relative to the backend? Are there multiple services that talk to each other? One paragraph is usually enough.

Conventions. This is where you save the most tokens over time. Naming conventions, file organization patterns, how you handle errors, how state is managed, what a component is expected to look like. The more explicit you are here, the less Claude improvises.

Anti-patterns. Equally important — what should never happen. Don't split this component. Don't add a new dependency without flagging it. Don't touch this file. Explicit guardrails prevent a whole category of suggestions you'd just have to reject anyway.

Agent instructions. If you're using Claude as a coding agent assigned to GitHub issues, this section matters a lot. Tell it exactly what a completed issue looks like. Scope what it's allowed to do. My biggest lesson here: agents that do full file audits on every issue are expensive and slow. A single line in CLAUDE.md that says "do not audit files unrelated to the current issue" cuts that waste immediately.

Here's a minimal template to get started:

# CLAUDE.md

## Stack
- Framework: Next.js 16, React 19
- Language: TypeScript (strict)
- Database: PostgreSQL via Prisma 7
- Styling: Tailwind CSS v4
- Testing: Vitest, Playwright
- Package manager: pnpm

## Architecture
Monolithic Next.js App Router application. All DB routes are force-dynamic.
Business logic lives in /lib. Components in /components. API routes in /app/api.

## Conventions
- Use named exports for components
- Co-locate tests with the files they test
- Error handling via Result pattern — no raw throws in business logic

## Anti-patterns
- Do not split single-file components unless explicitly asked
- Do not install new dependencies without flagging them first
- Do not modify /prisma/schema.prisma without explicit instruction

## Agent Instructions
- Work only within the scope of the assigned issue
- Do not audit files unrelated to the current task
- A completed issue includes passing lint, typecheck, and tests

Adapt it to your project. The goal isn't completeness — it's giving Claude enough signal to stop guessing.

The Skill System

CLAUDE.md solves the context problem. But there's a second problem it doesn't address: repetitive procedures.

Every project has tasks that follow the same pattern every time. Fixing a bug. Writing a README. Building a new component. Reviewing a pull request. Without any guidance, Claude approaches each of these from scratch — and the output varies based on how well you happened to phrase the prompt that day.

Skills fix that.

A Skill is a Markdown file that teaches Claude a specific, repeatable process. It lives in a /skills folder in your repo, and you reference it when you need it — either in a prompt, or in an agent's instructions. Where CLAUDE.md gives Claude permanent knowledge about your project, a Skill gives it a procedure to follow for a specific type of task.

Here's what a bug fix Skill might look like:

# SKILL: Bug Fix

## Process
1. Read the issue description and identify the expected vs actual behavior
2. Locate the relevant file(s) — do not audit unrelated files
3. Identify the root cause before writing any code
4. Write the minimal fix that resolves the issue without side effects
5. Add or update tests to cover the fixed behavior
6. Verify lint and typecheck pass before marking complete

## Rules
- Do not refactor code outside the scope of the bug
- Do not change function signatures unless the bug requires it
- If the fix requires touching more than 3 files, flag it before proceeding

That's it. No magic. Just a documented process in a format Claude can read and follow consistently.

The real power is that Skills are modular and reusable. Write a README Skill once and every project gets the same quality README. Write a component Skill and every new component follows your conventions without you having to re-explain them. Stack Skills together and you're not just giving Claude context — you're giving it a repeatable workflow.

Here's where it gets interesting: Skills can also generate other parts of the system. I built a CLAUDE.md generator Skill — you point it at a new project, answer a few questions, and it produces a complete, properly structured CLAUDE.md ready to drop in. No remembering the syntax. No staring at a blank file. The link is in the closing section.

And at the meta level, there's a skill-builder Skill — a Skill that helps you write new Skills. You describe what you want Claude to be able to do repeatedly, and it walks you through drafting, testing, and refining the Skill until it's solid. The system is fully composable. Once you're in it, you're building your own AI workflow toolkit.

This is where the system starts to feel less like prompting and more like onboarding a reliable collaborator.

Putting It Together

On their own, CLAUDE.md and Skills are both useful. Together, they cover the two failure modes that make Claude frustrating: starting cold, and improvising the process.

Here's the mental model: CLAUDE.md is the onboarding. Skills are the SOPs.

When a new employee joins a team, you don't just hand them a list of procedures and send them off. You first get them up to speed on what the company does, how it's structured, and what the culture is. Then you hand them the playbooks for the specific tasks they'll be doing. One without the other leaves gaps.

Claude works the same way. A Skill without CLAUDE.md means Claude knows how to fix a bug generically, but not how your project specifically handles errors, or what files are off limits. A CLAUDE.md without Skills means Claude knows your codebase but still improvises the process every time.

Together, every session starts with Claude already knowing your project — and every task starts with Claude already knowing how to approach it.

In practice, my workflow looks like this. Each repo has a CLAUDE.md in the root. A /skills folder holds Skills for the recurring task types on that project. When I open a session or assign an issue to an agent, both are already there. I don't re-explain anything. I don't re-establish context. I just describe the problem and get to work.

The token savings are real. The consistency is real. But the bigger shift is psychological — Claude stops feeling like a tool you have to wrangle and starts feeling like a collaborator who already knows the codebase.

That's the whole system. And the best part is you can start with just the CLAUDE.md today, in under an hour, and feel the difference immediately.

Start Here

Most developers using Claude daily are paying — in tokens, in time, in frustration — for a workflow problem with a straightforward fix.

Create a CLAUDE.md in your current project root. The template above is a starting point. Or use the CLAUDE.md generator Skill and let Claude build it from your answers. An imperfect one written today beats a perfect one you never get around to.

Pick one recurring task and write a Skill for it. Bug fixing is the obvious first choice. README generation is another easy win. If you're stuck, the skill-builder Skill walks you through the process.

The first session after this is set up is usually when it clicks.

Both the CLAUDE.md template and the generator Skill are available in my ai-workflow-toolkit. Fork them, adapt them, make them yours. If you build something interesting on top of this system, I'd genuinely like to see it.

This is the first article in a series on building with AI agents. Next up: I built an agentic loop that writes code, reviews it, and iterates on its own output — here's how Code Genie works and what it gets right that most AI coding tools get wrong.

Tags: claude-md, claude, ai-coding-tools, developer-productivity, llm-context-management
Series: Building With AI Agents — Article 1 of 12