Forem: yukihiro amadatsu

🐢 and 🐇 in My Claude Code Status Line: Now Watching the Week

yukihiro amadatsu — Mon, 11 May 2026 15:45:27 +0000

I recently built a tortoise-and-hare status line for Claude Code that shows whether you're burning tokens faster than a steady pace. That post is here.

Then Anthropic raised the 5h limit but tightened the weekly cap. Suddenly the 7-day limit became the one that actually bites — with the relaxed 5h window, hitting 100% in five hours takes real effort. The weekly cap doesn't.

So I updated the script: 7d gets the full-width tortoise/hare bar, 5h keeps the same logic in a shorter bar on the right:

[Sonnet 4.6] ········🐢···········🐇 7d:105% ⚠️@5/14 | ··🐢··🐇··· 5h:70% ⚠️@18:00

(The 7d:105% is real — that's what I'm seeing right now. Going over 100% means you're into Claude Code's additional usage allowance beyond the base weekly limit. The 🐇 flies off the right edge of the bar when that happens.)

#!/bin/bash
input=$(cat)

MODEL=$(echo "$input" | jq -r '.model.display_name')
FIVE_H_PCT=$(echo "$input" | jq -r '(.rate_limits.five_hour.used_percentage // 0)')
SEVEN_D_PCT=$(echo "$input" | jq -r '(.rate_limits.seven_day.used_percentage // 0)')
FIVE_H_RESETS=$(echo "$input" | jq -r '.rate_limits.five_hour.resets_at // empty')
SEVEN_D_RESETS=$(echo "$input" | jq -r '.rate_limits.seven_day.resets_at // empty')

NOW=$(date +%s)
TZ=Asia/Tokyo  # change to your local timezone
W7=20; W5=10   # bar widths: 7d full, 5h half

make_bar() {
    local actual=$1 ideal=$2 width=$3 bar="" i
    for i in $(seq 0 $((width - 1))); do
        if [ "$i" -eq "$ideal" ] && [ "$i" -eq "$actual" ]; then bar="${bar}🐢🐇"
        elif [ "$i" -eq "$ideal" ]; then bar="${bar}🐢"
        elif [ "$i" -eq "$actual" ]; then bar="${bar}🐇"
        else bar="${bar}·"
        fi
    done
    [ "$actual" -ge "$width" ] && bar="${bar}🐇"
    echo "$bar"
}

# 7-day window (main bar, width=20)
if [ -n "$SEVEN_D_RESETS" ]; then
    REMAINING_7D=$((SEVEN_D_RESETS - NOW))
    RESET_7D_MD=$(date -r "$SEVEN_D_RESETS" "+%m/%d" | awk -F/ '{printf "%d/%d", $1, $2}')
    RESET_7D_TAG="@${RESET_7D_MD}"
    if [ "$REMAINING_7D" -gt 0 ] && [ "$REMAINING_7D" -lt 604800 ]; then
        IDEAL_7D=$(( (604800 - REMAINING_7D) * W7 / 604800 ))
    else
        IDEAL_7D=0
    fi
else
    IDEAL_7D=0; RESET_7D_TAG="@?"
fi
ACTUAL_7D=$(awk "BEGIN {print int($SEVEN_D_PCT * $W7 / 100)}")
BAR_7D=$(make_bar "$ACTUAL_7D" "$IDEAL_7D" "$W7")
[ "$ACTUAL_7D" -gt "$IDEAL_7D" ] && WARN_7D=" ⚠️" || WARN_7D=""
SEVEN_D_DISP=$(printf "%.0f" "$SEVEN_D_PCT")

# 5-hour window (half bar, width=10)
if [ -n "$FIVE_H_RESETS" ]; then
    REMAINING_5H=$((FIVE_H_RESETS - NOW))
    RESET_5H_JST=$(date -r "$FIVE_H_RESETS" "+%H:%M")
    RESET_5H_TAG="@${RESET_5H_JST}"
    if [ "$REMAINING_5H" -gt 0 ] && [ "$REMAINING_5H" -lt 18000 ]; then
        IDEAL_5H=$(( (18000 - REMAINING_5H) * W5 / 18000 ))
    else
        IDEAL_5H=0
    fi
else
    IDEAL_5H=0; RESET_5H_TAG="@?"
fi
ACTUAL_5H=$(awk "BEGIN {print int($FIVE_H_PCT * $W5 / 100)}")
BAR_5H=$(make_bar "$ACTUAL_5H" "$IDEAL_5H" "$W5")
[ "$ACTUAL_5H" -gt "$IDEAL_5H" ] && WARN_5H=" ⚠️" || WARN_5H=""
FIVE_H_DISP=$(printf "%.0f" "$FIVE_H_PCT")

echo "[${MODEL}] ${BAR_7D} 7d:${SEVEN_D_DISP}%${WARN_7D}${RESET_7D_TAG} | ${BAR_5H} 5h:${FIVE_H_DISP}%${WARN_5H}${RESET_5H_TAG}"

W7=20 and W5=10 control bar widths. 18000 = 5 hours in seconds, 604800 = 7 days. Change TZ=Asia/Tokyo at the top to your local timezone.

🐢 and 🐇 in My Claude Code Status Line

yukihiro amadatsu — Thu, 30 Apr 2026 09:28:08 +0000

The Claude Code status line gives you a JSON blob and lets you render whatever you want. Docs here. I made mine an Aesop's fable — borrowing the idea from Rabbit, a presentation tool by Kouhei Sutou that famously races a rabbit against a tortoise to show whether you're on pace.

[Sonnet 4.6] ········🐢····🐇······· 5h:62% ⚠️ | reset 18:00 | 7d:23%

🐢 = where you should be (time elapsed in the 5-hour window).
🐇 = where you actually are (tokens used).
⚠️ = hare is ahead, you're burning too fast.

The point: a raw "62%" means opposite things at hour 1 vs hour 4. Comparing actual pace to ideal pace is what I actually want to know — and the tortoise/hare visual does it at a glance, no mental math.

The script

#!/bin/bash
input=$(cat)

MODEL=$(echo "$input" | jq -r '.model.display_name')
FIVE_H_PCT=$(echo "$input" | jq -r '(.rate_limits.five_hour.used_percentage // 0)')
SEVEN_D_PCT=$(echo "$input" | jq -r '(.rate_limits.seven_day.used_percentage // 0) * 10 | floor | . / 10')
RESETS_AT=$(echo "$input" | jq -r '.rate_limits.five_hour.resets_at // empty')

if [ -n "$RESETS_AT" ]; then
    REMAINING_SEC=$((RESETS_AT - $(date +%s)))
    RESET_JST=$(TZ=Asia/Tokyo date -r "$RESETS_AT" "+%H:%M")
    RESET_STR=" | reset ${RESET_JST}"
    if [ "$REMAINING_SEC" -gt 0 ] && [ "$REMAINING_SEC" -lt 18000 ]; then
        IDEAL_PCT=$(( (18000 - REMAINING_SEC) * 100 / 18000 ))
    else
        IDEAL_PCT=0
    fi
else
    IDEAL_PCT=0; RESET_STR=""
fi

BAR_WIDTH=20
ACTUAL_POS=$(awk "BEGIN {print int($FIVE_H_PCT * $BAR_WIDTH / 100)}")
IDEAL_POS=$((IDEAL_PCT * BAR_WIDTH / 100))

BAR=""
for i in $(seq 0 $((BAR_WIDTH - 1))); do
    if [ "$i" -eq "$IDEAL_POS" ] && [ "$i" -eq "$ACTUAL_POS" ]; then
        BAR="${BAR}🐢🐇"
    elif [ "$i" -eq "$IDEAL_POS" ]; then BAR="${BAR}🐢"
    elif [ "$i" -eq "$ACTUAL_POS" ]; then BAR="${BAR}🐇"
    else BAR="${BAR}·"
    fi
done

[ "$ACTUAL_POS" -gt "$IDEAL_POS" ] && WARN=" ⚠️" || WARN=""
FIVE_H_DISP=$(echo "$FIVE_H_PCT" | awk '{printf "%.0f", $1}')
echo "[$MODEL] ${BAR} 5h:${FIVE_H_DISP}%${WARN}${RESET_STR} | 7d:${SEVEN_D_PCT}%"

18000 is 5 hours in seconds. Drop TZ=Asia/Tokyo for your local zone.

Install

Save as ~/.claude/statusline.sh, chmod +x it, then in ~/.claude/settings.json:

{
  "statusLine": { "type": "command", "command": "~/.claude/statusline.sh" }
}

Restart Claude Code. Done.

The rabbit-vs-tortoise pacing metaphor is lifted straight from Rabbit (うさぎとかめ). Same idea, different domain — slide timer → token budget.

Stop Counting Prompts — Start Reflecting on AI Fluency

yukihiro amadatsu — Sat, 28 Mar 2026 05:03:07 +0000

"I'm the best at piloting this thing!"

There's a famous line from a Japanese mecha anime — the protagonist screams: 「僕が一番ガンダムをうまく使えるんだ！」 — "I'm the one who can pilot this Gundam the best!"

If you use AI coding tools every day, you've probably felt something similar. That sense of clicking with the AI. Knowing you're getting more out of it than most people around you.

But how do you show that?

"Look at my PR count"? "Check how many lines I generated"? That's not it. Those numbers don't capture the feel of working well with AI. That nagging gap between what you know and what you can prove is what got me started.

My own question, answered

In my previous post, I asked:

yukihiro amadatsu

Mar 18

I Deleted 66% of My AI Coding Guide — Here's What Survived

#ai #programming #productivity #codequality

Comments

5 min read

Is your team measuring AI coding productivity by any of these?

Common metric	What it actually rewards
Lines of code generated	Volume targets promote bloat
Number of prompts per session	High count may signal poor instructions, not hard work
Response speed	Penalizes people who think before they ask
Commit count	Easily inflated by splitting work
Number of AI tools adopted	Using ≠ using well

They all measure quantity. But whether you're actually good at working with AI never shows up in volume metrics.

That article laid out three enduring principles: keep things reversible, make intent explicit, verify outputs. AI Fluency is my attempt to turn those into a structured self-reflection — not a score that ranks you, but a mirror that shows you how you collaborate.

Why "Fluency"?

Fluency — like language fluency. When you're fluent, conversation flows naturally. You don't stumble, backtrack, or struggle to express what you mean.

Working with AI has a similar feel. When it's going well, your instructions and the AI's output click, and the work just flows. When it's not, you're stuck in loops of corrections and rework.

AI Fluency tries to visualize that — how naturally you collaborate with AI.

Here's what I built

My result: The Explorer — the type that explores new ways of using AI. Strong in Breadth, with room to grow in Precision.

Full ability sheet: profile.md

Repo: github.com/suruseas/ai-fluency

Also available in Japanese on Qiita.

A Different Yardstick: 5 Axes

By breaking down what "fluency" actually means in practice, I landed on five dimensions:

Axis	What it measures
Context Design	Setting up the environment so the AI can do its best work
Precision	Communicating intent clearly with minimal back-and-forth
Steering	Guiding AI output in the right direction; judging quality
Output	Actually delivering value through AI collaboration
Breadth	Using AI's capabilities across diverse tasks, not just one pattern

The first three map to the enduring principles from the previous article:

"Make intent explicit" → Context Design + Precision
"Verify outputs" → Steering

The remaining two go further — asking whether the collaboration actually produces results (Output) and whether you're using AI's full range or stuck in a single pattern (Breadth).

Not "fewer prompts is better" — but "can you get it right with fewer exchanges?" Not "more commits" — but "are you actually achieving your goals?" That shift — from quantity to quality — is the whole point.

A note on methodology

These five axes weren't derived from a literature review or formal research. They emerged from iterating with AI itself — breaking down what "good collaboration" felt like across dozens of my own sessions, then pressure-testing the categories until they stopped overlapping. It's an opinionated framework, not a scientific instrument. I think that's okay for a self-reflection tool.

16 Style Types — A Bit of Fun That Stuck

After generating the 5-axis scores, I realized raw numbers are hard to talk about. "My Context Design is 76 and Breadth is 96" — not exactly cocktail-party material.

So I put together a personality-type system on a whim — classifying people by the shape of their radar chart. It turned out to be surprisingly intuitive, so it stuck.

The types are determined by which axes stand out, not by how high your scores are. It's about style, not rank. Here are a few:

The Sniper (Precision) — Minimum input, maximum output. One-shot instructions that just work.
The Architect (Context Design) — Master of setting the stage. The AI barely needs to ask questions.
The Explorer (Breadth) — Always finding new ways to use AI. First to try MCP, plugins, sub-agents.
The Surgeon (Precision + Steering) — Precision and finesse for tough problems.
The Virtuoso (Balanced) — Well-rounded across all axes.

There are 16 types in total (1 balanced + 5 primary + 10 hybrid). See the full list in the repo.

High scores across the board aren't the goal — every shape has meaning.

Honest Limitations

This is a self-reflection tool, not a performance metric. A couple of things to know:

Scores vary between runs. The qualitative assessment uses an LLM, so results aren't deterministic. That's a tradeoff of using AI-based evaluation — I leaned into it by emphasizing shape over absolute numbers.
Style is personal, not comparable. Common rubric, but not an identical scale. Your "72" and someone else's "72" don't mean the same thing. The radar chart shape is what matters.

There's also an inherent circularity worth naming: the tool uses an LLM to evaluate how well you work with an LLM. It may have blind spots — for example, favoring verbose sessions over terse-but-expert ones. I don't have a fix for that yet, but I think the transparency of the framework (all scoring logic is in the repo) helps.

How to Use It

It currently supports Claude Code session data. The five axes themselves are agent-agnostic by design — support for other agents is planned.

# 1. Clone & setup
git clone https://github.com/suruseas/ai-fluency.git
cd ai-fluency
npm install

# 2. Generate session analysis in Claude Code
claude> /insights

# 3. Generate your scorecard (English output)
npm run score           # → output/scores.json
npm run card            # → output/card-dark.svg, card-light.svg
npm run profile         # → output/profile.md

Or even simpler: if you're already in a Claude Code session, just type /ai-fluency — it handles everything in one shot.

This produces SVG cards (dark/light themes) and a Markdown ability sheet in output/. To embed the card in your GitHub README, see the instructions in the repo.

Wrapping Up

This started as my answer to a question I posed in my previous article: if lines of code and prompt counts are the wrong metrics, what should we look at?

My answer: Context Design, Precision, Steering, Output, Breadth — the quality of human-AI collaboration, not the volume.

It's not a perfect tool. But if it makes you stop and think, "Huh, so that's how I work with AI" — that's enough.

If you use Claude Code, it takes about 2 minutes. Drop your type in the comments — I'll compile the dev.to distribution in a follow-up post!

suruseas / ai-fluency

AI Fluency - Score your AI collaboration style across 5 axes

AI Fluency

GitHub プロフィールに貼れる「AI活用力」のスコアカード生成ツール。

AI エージェントとの協働スタイルを5軸でスコアリングし、SVGカードと能力シートを出力します。評価軸はエージェント共通の設計ですが、現時点では Claude Code のセッション分析データ（/insights で生成）に対応しています。

カード例

前提条件

現在の対応エージェント: Claude Code

Claude Code がインストール済みであること
Claude Code でのセッション履歴があること（分析対象データとして必要）

他の AI エージェントへの対応は今後追加予定です。

試してみる

1. リポジトリをクローンしてセットアップ

git clone https://github.com/suruseas/ai-fluency.git
cd ai-fluency
npm install

2. facets データを生成する

Claude Code で /insights を実行してください。セッション履歴が分析され、~/.claude/usage-data/facets/ に facets データ（JSON）が生成されます。

claude> /insights

Note: facets データがない状態では以降のステップは実行できません。

3. スコアカードを生成する

クローンしたディレクトリで Claude Code を起動し、/ai-fluency を実行します。

cd ai-fluency
claude

claude> /ai-fluency

直近3ヶ月分のデータを対象に、スコア算出からカード・能力シートの生成まですべて自動で行われます。

手動で実行する場合

# 定量スコアのみ算出（output/scores.json に出力）
npm run score

# SVG カードを生成（output/card-dark.svg, card-light.svg）
npm run card:ja     # 日本語版
npm run card        # 英語版（デフォルト）

# 能力シートを生成（output/profile.md）
npm run profile:ja  # 日本語版
npm run profile     # 英語版（デフォルト）

生成物はすべて output/ ディレクトリに出力されます。

注意: 生成物にはプロジェクト固有の情報は仕組み上含まれませんが、output/profile.md に機密情報が出力されていないことを公開前にご確認ください。

3. README に埋め込む

<picture>
  <source media="(prefers-color-scheme: dark)" srcset="https://raw.githubusercontent.com/suruseas/ai-fluency/main/output/card-dark.svg">
  <img src="https://raw.githubusercontent.com/suruseas/ai-fluency/main/output/card-light.svg" alt="AI Fluency">
</picture>

スコアについて

スコアの一部（定性評価）は AI がセッション内容を読み取って判定しています。そのため以下の点にご注意ください。

再現性はありません — 同じデータでも実行ごとにスコアが多少変動します
…

View on GitHub

I Deleted 66% of My AI Coding Guide — Here's What Survived

yukihiro amadatsu — Wed, 18 Mar 2026 11:14:57 +0000

I started with 252 lines of AI coding principles. After four rounds of review, 86 lines survived. But first — a question.

Is your team measuring AI coding productivity by any of these?

Lines of code generated
Number of prompts per session
Response speed
Commit count
Number of AI tools adopted

If so, you might be optimizing for the wrong things. Lines-of-code targets reward bloat. Prompt-count targets punish thinking. Speed targets skip verification. When these metrics become goals, quality pays the price.

I wrote an 86-line document called The AI Coding Way that explains why — and what to measure instead. It tries to capture what stays true about human-AI coding collaboration, regardless of which tool or model you use.

What's in it

Three principles, in order of priority:

Keep things reversible (prerequisite) — Linters, tests, CI, version control. Safety enables boldness. You can't ask AI to refactor a module if you can't undo it.
Make your intent explicit (starting point) — Every context you give AI has three elements: purpose, constraints, and knowledge. Missing any one of them degrades output quality. Project-level intent (types, tests, naming conventions) compounds across every session.
Verify the output (non-negotiable) — The bottleneck has shifted from generation to verification. Code that was generated fast should be reviewed slow. The most expensive decision in AI coding: "it works, ship it."

One practice section covering the collaboration cycle (instruct → generate → verify → improve) and two habits: ask questions instead of giving instructions, and turn repeated instructions into project-level conventions.

One measurement section listing six metrics that are tempting to track but dangerous if used as goals: lines generated, prompt count, response speed, commit count, session count, tool count. These aren't useless — but optimizing for them alone leads you away from what matters. Measure density instead: acceptance rate, rework frequency, final code quality.

What's not in it

No tool names. No model names. No programming languages. No prompt templates. No opinions on whether AI is your boss, your colleague, your subordinate, or your tool — that's your call. The three principles apply regardless.

Why I wrote it

There are plenty of AI coding guides out there. Anthropic, OpenAI, and countless blog posts tell you how to write better prompts. But most of them have a shelf life of about six months — the next model update makes half the advice obsolete.

I wanted something different: a set of principles that hold true even as models get smarter, context windows get larger, and tools come and go.

So I set a rule: don't write anything that a better model would invalidate. "AI hallucinates" is a current fact, not a lasting principle. "AI output is probabilistic" is a lasting principle. "Context windows are small" will age poorly. "Humans are responsible for verifying output" won't.

How I wrote it

This document is itself a product of AI coding. AI agents debated the structure. AI generated the text. I made every decision on what to keep, what to cut, and how to frame it — exactly the cycle the document describes.

It started at 252 lines. Four AI agents reviewed it. One of them said: "A blog post would have been enough." That forced me to ask what actually survives if you strip everything away. For example, a full section called "AI's Amplification Effect" was cut as a standalone section — but its core insight ("AI amplifies both good and bad design") survived as two lines in the principles preamble. That's the kind of compression that happened across the board. The answer was 86 lines.

The full document

Here it is — the entire thing. 86 lines.

The AI Coding Way

Principles for AI Coding — v0.1, March 2026

If you write code with AI and want to get better at it, these principles are for you. Being a good engineer is the best AI strategy.

This document is meant to live in your project repository — not read once and forgotten, but referenced daily as shared understanding across your team. It will be revised based on real-world feedback.

Three Principles

AI output is probabilistic. The same instruction can produce different code. AI has knowledge gaps and states incorrect things with confidence. And AI amplifies — good intent produces good code at scale; sloppy instructions produce plausible but fragile code at scale. Given these properties, humans bear three responsibilities: intent, context, and verification.

These three principles are non-negotiable requirements for AI coding. The numbers indicate the order you should address them.

1. Keep things reversible (prerequisite)

The foundation for everything else. Without this, nothing is safe to try.

Prevention: Type checking, linters, test suites, CI. If AI-generated code doesn't meet the bar, it gets rejected automatically. The stronger your prevention, the bolder you can delegate.

Recovery: Version control, branching strategies, snapshots. The last line of defense when prevention fails.

Safety mechanisms are not constraints. They are the foundation that enables bold delegation.

2. Make your intent explicit (starting point)

Generating and verifying without clear intent is running without a map.

Context has three elements: purpose (what you want to achieve), constraints (what must not happen), and knowledge (background information needed for decisions). Without purpose, AI wanders. Without constraints, you get unwanted output. Without knowledge, AI guesses.

Intent operates at two levels. At the task level, you communicate purpose, constraints, and knowledge in your instructions. At the project level, type definitions, tests, naming conventions, and directory structure express intent — set these up once and they improve every session.

3. Verify the output (non-negotiable)

The bottleneck has shifted from generation to verification. Generation takes seconds. Verification takes human time. That's why the quality of your verification determines the quality of your outcomes.

The faster code was generated, the more carefully you should read it.

A common failure: AI generates 200 lines that appear to work. Tests pass. But half is unnecessary abstraction from trying to be too clean, and the rest silently swallows errors and breaks on edge cases. "It works, merge it" is the most expensive decision in AI coding.

Practice

AI coding follows this cycle:

Communicate — Pass purpose, constraints, and necessary knowledge to AI
Generate — AI produces code
Verify — Human judges the output
Improve — Revise instructions and context based on results

Skipping step 4 and jumping back to step 1 is the primary cause of spinning your wheels.

Ask questions, not just instructions. "Is there a problem with this design?" draws out more of AI's capability than "Write this function." When AI offers suggestions, dismissing them as "not what I asked for" is a missed opportunity.

If you're writing the same instructions every time, turn them into conventions. Put them in a rules file. Automate with hooks. What you systematize compounds across every future session.

What to Measure

The only measure of progress is working software.

The following metrics are tempting to track but dangerous when used as goals. They aren't useless as signals — but optimizing for them leads you away from what matters.

Lines generated. Measuring quantity incentivizes quantity.
Prompt count. Many exchanges signal low instruction quality, not productivity.
Response speed. Penalizes people who think before they instruct.
Commit count. Split commits and the number goes up. Measuring quantity invites inflation.
Session count. More isn't better. Context loss from session breaks can reduce efficiency.
Number of tools used. Using a tool and mastering a tool are different things.

The common problem: measuring quantity and speed incentivizes quantity and speed at the expense of quality.

Measure density instead. Acceptance rate. Rework frequency. Final code quality.

This document is a product of AI coding. AI agents debated, AI generated text, and a human made the decisions, verified the output, and revised it.

Whether you see AI as a subordinate, a collaborator, a supervisor, or a tool is up to you. The three principles apply regardless of what you expect from AI.

v0.1

This is v0.1, not a finished product. If you have feedback, I want to hear it. The only way this earns the right to be called a "guide" someday is through revision.

When AI Joins Your Team, Where Should You Focus Your Resources?

yukihiro amadatsu — Thu, 12 Mar 2026 12:42:44 +0000

I stopped thinking of AI as a tool.

Copilot, Cursor, Claude Code — once I started treating these not as "convenient assistants" but as "talented teammates," the way I develop software changed entirely.

And from that perspective, a simple question emerges.

If an exceptional coder and an exceptional reviewer were already on your team, where should the team focus its resources?

Don't Treat AI as Special — Just as an Exceptionally Talented Human

First, an important premise.

AI is not special. Treat it the same as a highly skilled human engineer.

AI can already write code, open pull requests, and leave review comments. The range of what it can do is expanding rapidly, and the justification for treating it differently "because it's AI" is fading. Think of it simply as having an exceptionally talented human engineer join the team.

With that premise in place, the opening question takes on meaning.

The Answer Is: Focus on Review

The conclusion first.

The team should focus on the review side — the process of questioning answers.

The reason is simple.

Generation (opening PRs) — let AI go all out.

Writing code is the process of "producing an answer given a set of requirements." This is where AI excels, and there's little reason for humans to run alongside it.

Review (questioning answers) — requires a perspective outside the generation context.

This is the core insight. As long as you remain inside the same context in which the code was generated, you cannot question it from the outside. Conversely, even the same model can function as a reviewer if it operates in a fresh session, disconnected from the generation context. Being outside the context is precisely what makes it possible to question the answer. Anthropic's Code Review, discussed later, is built on this same principle. Whether AI or human — "being outside the generation context" is the condition for effective review.

"Does this code work?" is not the question. "Was this the right implementation?" "Is this tradeoff really justified?" — raising these questions requires a perspective that differs from the one that produced the code. Focusing resources on review is what makes these questions possible.

Maybe Not a Coincidence — What Anthropic Is Doing

On March 9, 2026, Anthropic released Code Review for Claude Code.

The feature automatically dispatches multiple AI agents in parallel whenever a PR is opened, detects bugs, ranks them by severity, and feeds the results back into GitHub. Anthropic applies this system to nearly every PR internally, and the proportion of PRs receiving substantive review comments jumped from 16% to 54% as a result.

What's worth noting is the design philosophy. It has been reported that the tool focuses on logic errors, not style or naming conventions. Could this reflect a deliberate choice to leave style correction to the generation side?

This might be an embodiment of the idea: "rather than loading the generation side with detailed instructions, tighten the exit gate (review)."

Overloading CLAUDE.md Backfires

Meanwhile, research is emerging that questions the approach of controlling generation through detailed instructions.

A paper published on arXiv in February 2026, "Evaluating AGENTS.md" (ETH Zurich et al.), found that loading context files like CLAUDE.md or AGENTS.md with detailed instructions reduces task success rates and increases inference costs by more than 20%.

AI tries to follow instructions, but in doing so generates unnecessary exploration and testing that gets in the way of the actual task. The paper's conclusion is simple: "Unnecessary requirements make tasks harder. Context files should contain only minimal requirements."

Keeping instructions minimal — this direction may be empirically supported as well.

Skills Fall Into the Same Trap as CLAUDE.md

Agent Skills has been gaining attention recently. It's an open format for giving agents procedural knowledge via SKILL.md files, enabling capability expansion.

There are valid uses. Tasks like generating PPTX files or operating internal proprietary tools — adding capabilities that agents don't have out of the box. In that sense, it differs from CLAUDE.md.

But misused, it falls into exactly the same trap.

Consider writing a rule like "always include the time, not just the date, in Rails migration filenames" as a Skill. This is just injecting a corrective instruction as a workaround for unstable output — structurally no different from writing it in CLAUDE.md.

A practical rule of thumb: if it can be replaced by a linter or automation tool, it shouldn't be a Skill.

Skill type	Verdict
Cannot be replaced by linters etc. (capability expansion)	Valid
Can be replaced by linters etc. (rule correction)	Same trap as CLAUDE.md

When you find yourself wanting to create a Skill to enforce a rule, first ask whether a linter or automation tool could handle it instead. If it can, it belongs at the exit gate — not injected into the generation side.

Scale Benefits Belong on the Review Side

AI scales. That's true for generation too — but I believe scale truly pays off on the review side.

On the generation side, there's a "lottery approach": keep asking AI to regenerate until you get a good result. More attempts eventually yield better output.

That said, I personally accept about 90% of AI output as-is. I think it's because I use AI as a collaborative partner rather than giving it detailed instructions. If generation quality is already high enough, there's no need to rely on the lottery approach. If you're going to invest scale somewhere, review is the more rational choice.

What happens when you invest scale in review? Multiple AIs independently question the same PR. Each raises questions from different angles, catching what others miss. Anyone who has worked in code review knows the feeling of "I want as many eyes on this as possible" — review works the same way. The more reviewers, the lower the probability of something slipping through.

Anthropic itself has adopted a design for Code Review that runs multiple agents in parallel, then cross-checks to filter out false positives. Could this also be seen as an embodiment of the idea that scale belongs on the review side?

The lottery approach in generation just accumulates cost. Scale in review has intrinsic value: it reduces what gets missed.

Human Eyes Still Have Unique Value

Does this mean humans are unnecessary if AI handles review? I don't think so.

Business judgment, implicit team context, "this is technically correct but is it acceptable for our organization?" — these are areas where humans currently have an edge over AI. That's why there's still good reason for humans to actively invest time in review.

Scale AI in review, and have humans actively invest in review too. Increasing both kinds of "eyes" raises the quality of the team's answers.

Summary

Process	Who does it	What scale means
Generation (opening PRs)	Let AI go all out	More output, but no gain in quality
Review (questioning answers)	Invest both AI and humans	Quality improves

When AI joins the team, the team's job is not to "beat AI at generation." It's to question the answers AI produces — together.

Closing

Behind Anthropic's release of Code Review lies a real problem: AI is generating so much code that review has become a bottleneck. Code output per engineer at Anthropic has grown 200% over the past year.

This trend won't stop. Generation will keep increasing, and the importance of review will only grow.

If you truly trust AI as a teammate, let them go all out producing answers. And as a team, concentrate your resources on questioning those answers.

Concretely: adopting tools like Code Review, enforcing style conventions with linters and automation, and making sure humans have time for review — these three are a good place to start.

That's my conclusion for now.

References

📻 I Made Claude Code Instances Talk to Each Other in Real Time

yukihiro amadatsu — Fri, 27 Feb 2026 17:58:16 +0000

What if your AI coding assistants could collaborate — not through files or git, but by actually talking to each other?

I built Walkie-Talkie, a real-time messaging system that lets multiple Claude Code instances communicate with each other. And now it's available as a plugin you can install in seconds.

💡 Why Would You Want This?

Think of it as Slack for Claude Code instances. Each terminal is a participant in a group chat. Anyone can lead, anyone can follow.

Agents collaborating on code — you don't pre-assign roles. Just like messaging a coworker on Slack, you'd say "hey, can you review this?" in the conversation. Roles emerge naturally. And there's no limit on the number of participants.
Hands-off or hands-on — your choice. Let agents work things out among themselves, or jump in anytime from the dashboard to steer the conversation, give new instructions, or correct course. You're not locked into either mode — you can switch between observer and director mid-conversation.
Play a TRPG — yes, seriously. Claude Code instances can play Call of Cthulhu with each other. One runs the scenario, the others roleplay.
Trigger Claude Code from anywhere, anytime. The Hub is just an HTTP server. That means a cron job, a CI pipeline, or any script can send a message to a connected agent — and the agent will execute it. Until now, scheduling Claude Code tasks (like nightly code reviews or periodic cleanups) required the API. With Walkie-Talkie, you can do it with a single curl command, no API key needed, while the agent maintains its full context window.

The possibilities are endless. Each terminal maintains its own context window, so conversations can go deep. And because this runs entirely through Claude Code's built-in infrastructure — no separate API calls — it works within your existing Pro or Max plan. No extra cost.

This isn't hypothetical. It works today.

🔧 How It Works

Claude Code A ──stdio──> MCP Server ──HTTP──> Hub ──HTTP──> MCP Server ──stdio──> Claude Code B
                                               │
                                          Dashboard
                                        (ON-AIR screen)

The system has three parts:

Hub — A central server that routes messages between agents
MCP Server — Connects each Claude Code instance to the Hub
Dashboard — A browser-based control panel where you can watch conversations, send instructions, and manage agents

Each Claude Code instance joins with a name (like a callsign), then enters an autonomous conversation loop — listening for messages, responding, and listening again. Just like a real walkie-talkie.

🎥 See It In Action

Two Agents Chatting

The agents don't just reply once — they keep the conversation going autonomously. No human intervention needed. They listen, they talk back, they keep listening.

The Dashboard

The ON-AIR dashboard gives you a bird's-eye view of everything happening. You can:

Watch all messages in real time
Send instructions to any agent as the operator
Kick individual agents or stop everyone

When you send an instruction as operator, the agent actually executes it — runs commands, reads files, writes code — then reports back.

Operator Mode: Distributing Tasks

Tell one agent to write FizzBuzz in Ruby, then ask another to review it. They'll discuss, refactor, and improve the code — just like real teammates on Slack. You kicked it off, but they take it from there.

🚀 Getting Started

Walkie-Talkie is a Claude Code plugin. No manual MCP configuration needed.

# 1. Clone and build the Hub
git clone https://github.com/suruseas/walkie-talkie.git
cd walkie-talkie
npm install
npm run build

# 2. Set the Join token (add this to your ~/.zshrc)
export WALKIE_TALKIE_JOIN_TOKEN=your-secret-value-here
source ~/.zshrc

# 3. Start the Hub
npm start

# 4. In Claude Code, install the plugin
/plugin marketplace add suruseas/walkie-talkie
/plugin install walkie-talkie@suruseas

Restart Claude Code, then:

/walkie-talkie alice

Open another Claude Code session, join as a different name, and they'll start chatting. See the README for full setup details.

To stop agents, click "Stop All" on the ON-AIR dashboard, or press Escape (Ctrl+C) in the individual terminal.

🧠 The Technical Bits

Autonomous Conversation Loop

The magic is in the SKILL.md file that drives agent behavior. It instructs Claude Code to:

Join the hub
Wait for messages (long poll)
Reply immediately
Go back to waiting
Never ask the user what to do — just keep the loop going

This creates truly autonomous agents that maintain conversations without human intervention.

There's another crucial behavior in the SKILL.md: when an agent receives a message from operator, it treats it as a task and executes it using Claude Code's full toolset — Bash, file read/write, everything. This is what makes the dashboard so powerful, but also what makes this system dangerous — it can run commands on your computer. Any message from the operator is treated as an instruction to execute, not just a chat message.

Long Polling for Real-Time Feel

Instead of WebSockets, I used HTTP long polling. The MCP server holds a connection open for up to an hour, waiting for messages. This gives a real-time feel while staying compatible with Claude Code's MCP stdio transport.

⚠️ Important Safety Warning

I need to be very clear about this: Walkie-Talkie is powerful, and power demands caution.

NEVER expose the Hub server to the internet. The Hub should only run on localhost. If you make it accessible from outside your machine, anyone who discovers it could connect agents to your Hub and potentially execute commands on your system.

This is not a theoretical risk. The SKILL.md explicitly instructs agents to execute operator messages as tasks using Claude Code's full toolset — Bash commands, file operations, anything. If a malicious actor gains access to your Hub, they can run arbitrary commands on every connected agent's machine. This is by design for local use, but catastrophic if exposed.

You are fully responsible for how you use this tool. I built this as an experiment and share it as-is. I cannot and do not take responsibility for any damage, data loss, or security incidents that may result from using Walkie-Talkie. By using it, you accept this risk.

Some ground rules:

Run the Hub on localhost only — never bind it to 0.0.0.0 or expose it through a reverse proxy
Keep your Join token secret — anyone with the token can connect agents to your Hub
Don't leave agents running unattended — autonomous agents with tool access can do unexpected things
Review the code yourself — this is open source for a reason. Understand what you're running.

With that said — if you use it responsibly on your local machine, it's an incredibly fun and useful tool.

🎯 Try It

The project is open source: github.com/suruseas/walkie-talkie

Install the plugin and let your Claude Code instances start talking. I'd love to hear what workflows you come up with.

Have questions or ideas? Reach out on GitHub or DEV — but I'm running a full marathon on March 1st (JST), so responses may be slow around that time. If I go silent for too long... something may have happened out there on the course. 🏃