Forem: 김이더

GPT-5.5 Is Out — What the Numbers Actually Say

김이더 — Fri, 24 Apr 2026 01:54:00 +0000

Six Weeks, and "Spud"

GPT-5.4 came out six weeks ago. The release before that was in December. Before that, November.

The era when model releases were quarterly events is over. They're weekly-to-monthly events now.

The reason this pace is possible is simple. AI is accelerating AI development. According to OpenAI, Codex has 4 million weekly active users and ChatGPT has 9 million paying business users. Real usage feedback at that scale flows straight back into the next training cycle.

Look at Pachocki's statement again.

"The last two years have been surprisingly slow."

He's not saying the present is slow. He's declaring that the future will be faster. GPT-5.5 arrived in six weeks and even that, he's saying, was slow.

Greg Brockman described it in the same briefing as "a new class of intelligence" and "a big step toward agentic and intuitive computing." Strip the marketing and one thing remains: the model refresh cycle is now shorter than most product planning cycles.

The Benchmarks, As Published

Here are the numbers.

Terminal-Bench 2.0 — complex command-line workflows requiring planning, tool use, and iteration:

GPT-5.5         82.7%
GPT-5.4         75.1%
Claude Opus 4.7 69.4%
Gemini 3.1 Pro  68.5%

OSWorld-Verified — how well the model operates a computer autonomously:

GPT-5.5         78.7%
Claude Opus 4.7 78.0%
GPT-5.4         75.0%

SWE-Bench Pro — resolving real GitHub issues in a single pass:

GPT-5.5   58.6%

On Terminal-Bench, GPT-5.5 leads Opus 4.7 by 13.3 points. That's a big jump. But on OSWorld, the gap is 0.7 points. Dominant on one axis, barely ahead on another.

Not "crushing it" — just leading. And the era of ranking models by a single benchmark is already behind us. Computer use has been an area Anthropic invested in heavily, and the more honest reading is that OpenAI just about caught up rather than blew past.

Also: benchmarks are marketing material. OpenAI picked the numbers favorable to them. Real-world feel is something each team has to verify on their own workloads.

1M Context and Weird Token Economics

The pricing is interesting.

GPT-5.5        $5 / $30   per 1M tokens (input / output)
GPT-5.5 Pro    $30 / $180 per 1M tokens (input / output)
Context window 1M
Batch / Flex   half the standard rate
Priority       2.5x the standard rate

It's more expensive than GPT-5.4. But OpenAI claims it does the same work with fewer tokens. Their own post states that GPT-5.5 matches GPT-5.4 per-token latency in production serving.

Translation: the unit price went up, but token consumption goes down enough that the final bill could be similar or lower. What actually hits your wallet depends on your workload. Long-running agent tasks with lots of reasoning might come out ahead. Apps with tons of short one-shot calls might just get more expensive.

And 1M context. OpenAI caught up to territory Anthropic entered earlier. Long document analysis, full-repo understanding, long-running agent sessions — there are real workloads where 1M matters.

Worth noting is the GPT-5.5 Pro pricing. $30 input, $180 output. That's not priced for hobby developers. It's squarely an enterprise workload tier — agents running all day, complex research pipelines, nothing else makes sense at those rates.

Mythos, Code Red, and the Shape of Competition

The most telling line in the Axios report is this.

Internally at OpenAI, Anthropic's rise was reportedly treated as a "code red" moment, and that moment drove a pivot toward enterprise customers.

In the GPT-5.5 briefing OpenAI explicitly referenced Anthropic's Mythos. Mythos is Anthropic's latest frontier model, announced earlier this month but with a limited rollout due to cybersecurity capabilities. OpenAI's reason for bringing it up is clear: the signal they want to send is "we have Mythos-class cyber capability too."

The frontier model race right now isn't tech versus tech. It's enterprise budget versus enterprise budget. Fortune's piece quotes the CIO of Bank of New York, where they're running Anthropic and OpenAI side by side across 220+ AI use cases. Customers like that are the ones actually moving the market.

The real reason models ship every six weeks is here. It's not technical necessity. It's that your competitor can ship every six weeks. The moment you slow down, enterprise contracts start sliding over.

The interesting part is that this competitive dynamic is a win for users. A better model every six weeks, with pricing pressure arriving alongside it. Just having multiple frontier labs active keeps the whole field healthier.

What's Left Behind the Numbers

So what do you actually do with this.

Building your stack around a single model is an increasingly bad bet. There's a very high probability that a better model ships in six weeks. Might be OpenAI. Might be Anthropic. Might be Google. You can't predict which one.

The investment goes one layer up. Harness design, multi-agent orchestration, tool chains, evaluation pipelines, context engineering. These layers survive model swaps. Better yet, they get better as models get better.

Releases like GPT-5.5 are no longer news — they're environment. Infrastructure that updates on a schedule. Building your workflow on that assumption is the realistic stance for 2026.

The people who don't get emotionally tossed around by a 1-2 point benchmark swing are the ones who go the distance. If Terminal-Bench 82.7% becomes 85% in a few months, your workflow design mostly still applies.

"Models get replaced. Workflows compound."

GPT-5.5가 공개됐다, 숫자로 뜯어보면

김이더 — Fri, 24 Apr 2026 01:53:28 +0000

더 많은 글은 radarlog.kr에서.

어제(2026/4/23) OpenAI가 GPT-5.5를 공개했다. 코드네임 "Spud".

놀라운 건 모델 자체가 아니다. GPT-5.4가 나온 게 6주 전이다.

OpenAI 수석과학자 Jakub Pachocki는 브리핑에서 "지난 2년이 오히려 느렸다"고 말했다. 이 한 문장이 이 릴리스의 진짜 맥락이다.

6주, 그리고 "Spud"

GPT-5.4는 6주 전에 나왔다. 그 전 릴리스는 12월, 그 전은 11월.

모델이 분기 단위 이벤트였던 시대는 지나갔다. 지금은 주 단위, 길어야 한 달 단위 이벤트다.

이 속도가 가능한 이유는 단순하다. AI가 AI 개발을 가속하고 있다. OpenAI 발표에 따르면 Codex 주간 사용자가 4백만, ChatGPT 유료 업무 사용자가 9백만이다. 이 규모의 실사용 피드백이 바로 다음 학습 사이클로 돌아간다.

Pachocki의 발언을 다시 보자.

"지난 2년이 오히려 느렸다."

이건 지금이 느리다는 말이 아니다. 앞으로는 더 빨라질 거라는 선언이다. GPT-5.5도 6주 만에 나왔는데, 이것조차 느렸다는 말이다.

Greg Brockman은 같은 브리핑에서 "새로운 종류의 지능이고, 에이전틱하고 직관적인 컴퓨팅으로 가는 큰 한 걸음"이라고 표현했다. 마케팅 수사를 걷어내면 남는 건 하나다. 모델 교체 주기가 제품 기획 주기보다 짧아지고 있다.

벤치마크 숫자, 있는 그대로

수치부터 정리해보자.

Terminal-Bench 2.0 — 복잡한 커맨드라인 워크플로우(계획 → 도구 사용 → 반복) 평가:

GPT-5.5         82.7%
GPT-5.4         75.1%
Claude Opus 4.7 69.4%
Gemini 3.1 Pro  68.5%

OSWorld-Verified — 모델이 컴퓨터를 독립적으로 조작하는 능력 평가:

GPT-5.5         78.7%
Claude Opus 4.7 78.0%
GPT-5.4         75.0%

SWE-Bench Pro — 실제 GitHub 이슈를 단일 시도로 해결:

GPT-5.5   58.6%

Terminal-Bench에서 Opus 4.7 대비 +13.3%p 차이. 큰 점프다. 그런데 OSWorld에서는 Opus 4.7과 0.7%p 차이. 어떤 축에서는 크게 앞서고, 어떤 축에서는 턱걸이다.

"압도"가 아니라 "리드"다. 그리고 벤치마크 하나로 모델을 줄 세우는 시대는 이미 지났다. 컴퓨터 조작 능력은 Anthropic이 꾸준히 투자해온 영역이고, 그 격차를 OpenAI가 이번에 이번에 거의 따라붙었다 — 정도의 해석이 오히려 더 정확하다.

그리고 벤치마크는 마케팅 자료다. OpenAI가 자기에게 유리한 지표를 골라서 내놓는다. 실제 워크플로우에서 체감은 각자 검증해야 한다.

1M 컨텍스트와 이상한 토큰 경제

API 가격표가 재밌다.

GPT-5.5        $5 / $30   per 1M tokens (input / output)
GPT-5.5 Pro    $30 / $180 per 1M tokens (input / output)
Context window 1M
Batch / Flex   정가의 절반
Priority       정가의 2.5배

GPT-5.4보다 비싸다. 그런데 OpenAI는 "같은 일을 더 적은 토큰으로 끝낸다"고 주장한다. 실제 자사 블로그에는 "GPT-5.5가 실제 서빙에서 GPT-5.4와 같은 토큰당 지연시간을 유지한다"는 문장이 있다.

무슨 말이냐. 단가는 올랐지만 토큰 소비량이 줄어서 결과적으로 청구서가 비슷하거나 더 낮을 수 있다는 주장이다. 실제 지갑에 뭐가 찍힐지는 워크로드마다 다르다. 추론이 긴 에이전트 태스크에서는 유리할 수 있고, 짧은 단답형 콜이 많은 앱에서는 그냥 비싸질 수도 있다.

그리고 1M 컨텍스트. Anthropic이 먼저 간 구간을 OpenAI도 따라왔다. 긴 문서 분석, 큰 레포지토리 이해, 롱런 에이전트 세션 — 1M이 의미 있는 워크로드는 분명히 존재한다.

주목할 건 GPT-5.5 Pro 가격이다. 입력 $30, 출력 $180. 이건 일반 개발자용이 아니다. 명백히 엔터프라이즈 워크로드를 위한 가격이다. 에이전트가 하루 종일 돌아가는 케이스, 복잡한 연구 워크플로우 — 이런 데만 의미 있는 티어다.

Mythos, code red, 그리고 경쟁의 얼굴

Axios 리포트에서 가장 시사적인 한 문장은 이거다.

OpenAI 내부에서 Anthropic의 부상이 "code red" 수준으로 인식됐고, 이게 엔터프라이즈 고객 전략을 선회시킨 계기라는 보도.

GPT-5.5 브리핑에서 OpenAI는 Anthropic의 Mythos를 명시적으로 언급했다. Mythos는 Anthropic이 이달 초 발표한 최신 모델인데, 사이버보안 역량 때문에 출시 범위가 제한된 상태다. OpenAI가 이걸 언급하는 이유는 분명하다. "우리도 Mythos급 사이버 역량이 있다"는 신호를 보내는 거다.

지금 프론티어 모델 경쟁은 기술 대 기술이 아니다. 엔터프라이즈 예산 대 엔터프라이즈 예산이다. Fortune에 실린 Bank of New York CIO 코멘트를 보면 감이 온다. 그 은행은 Anthropic과 OpenAI를 병행 테스트하고 있고, 220+ AI 유스케이스를 돌리고 있다. 이런 고객이 실제 판을 흔든다.

6주마다 모델이 나오는 진짜 이유도 여기에 있다. 기술적으로 필요해서가 아니라, 상대가 6주마다 낼 수 있으니까. 한쪽이 멈추는 순간 엔터프라이즈 계약이 이동한다.

재밌는 건 이 경쟁 구도 자체가 사용자에게는 호재라는 점이다. 6주마다 더 좋은 모델이 나오고, 가격 압력도 같이 들어온다. 프론티어 랩이 여러 개 있다는 사실만으로도 판이 건강해진다.

숫자 뒤에 남는 질문

그래서 뭘 해야 하나.

모델 하나에 스택을 맞추는 건 점점 손해 보는 선택이다. 6주 뒤에 더 좋은 모델이 나올 확률이 매우 높기 때문이다. OpenAI가 낼 수도, Anthropic이 낼 수도, Google이 낼 수도 있다. 누가 낼지 미리 알 수 없다.

투자 포인트는 그 위 계층이다. 하네스, 멀티 에이전트 오케스트레이션, 툴 체인, 평가 파이프라인, 컨텍스트 엔지니어링. 이 계층은 모델이 바뀌어도 유지된다. 오히려 모델이 좋아질수록 이 계층이 더 잘 돌아간다.

GPT-5.5 같은 릴리스는 이제 뉴스라기보다 환경이다. 주기적으로 업데이트되는 인프라. 그걸 전제로 워크플로우를 짜는 게 2026년의 현실적인 접근이다.

벤치마크 1~2%p에 감정적으로 흔들리지 않는 쪽이 길게 간다. Terminal-Bench 82.7%가 몇 달 뒤에 85%로 바뀌어도, 워크플로우 설계는 대부분 그대로 쓸 수 있다.

"모델은 교체된다. 워크플로우는 축적된다."

The 12 Hours Claude Code Disappeared from Pro

김이더 — Thu, 23 Apr 2026 10:14:22 +0000

The original issue is on GitHub #42796, and coverage is at The Register.
More posts at radarlog.kr.

On the afternoon of April 21, 2026, Anthropic quietly removed Claude Code from the Pro plan.

The "includes Claude Code" line disappeared from the pricing page. The support docs that read "Using Claude Code with your Pro or Max plan" became "Using Claude Code with your Max plan." A few hours later, everything was rolled back. But this wasn't a one-off mistake.

The adaptive thinking rollout in February. The lowered default effort level in March. The AMD director's public analysis on April 6. The Pro removal experiment on April 21. These four points look like separate incidents, but they connect into one line.

Anthropic can't handle the economics of the long-running agent era, and they're trying several things at once to handle it.

Here's how those attempts connect, and how to read this signal if you're someone introducing AI coding tools into a team.

What Exactly Happened on April 21

Ed Zitron caught it first. On Anthropic's pricing page, the Pro tier's "includes Claude Code" checkmark had turned into a red X. The support docs were updated too. The Claude Code product page still mentioned Pro, but the core billing page had clearly pivoted to "Max only."

When The Register started reporting, Anthropic's Head of Growth Amol Avasare posted an explanation on X. "We're running a small test on ~2% of new prosumer signups. Existing Pro and Max subscribers aren't affected."

The part that came after is more interesting.

"When we launched Max a year ago, it didn't include Claude Code, Cowork didn't exist, and agents that run for hours weren't a thing. Max was designed for heavy chat usage, that's it."

"Since then, we bundled Claude Code into Max and it took off after Opus 4. Cowork landed. Long-running async agents are now everyday workflows. The way people actually use a Claude subscription has changed fundamentally."

Put "it's a 2% test" next to this and it stops sounding like a small experiment. It sounds closer to an admission that the current plan structure can't carry current usage patterns.

And two weeks before this, another event had already been building the same case.

Two Weeks Earlier: An AMD Director Showed Up With Telemetry

On April 2, Stella Laurenzo, director of AMD's AI group, filed issue #42796 against the Claude Code repo. The title: "Claude Code is unusable for complex engineering tasks with the Feb updates."

Laurenzo didn't write a vibes post about Claude feeling dumber. She brought quantitative analysis of 6,852 Claude Code sessions from her team's past three months, 234,760 tool calls, 17,871 thinking blocks. The former Google OpenXLA lead and current AI head at a $200B+ semiconductor company doesn't file a public GitHub issue on a hunch.

Three numbers matter. Thinking depth dropped 67% on average starting late February. The reads-per-edit ratio — how many files the model reads before editing one — fell from 6.6 to 2.0, a 70% reduction. And a stop-hook script built to catch "dodging responsibility, premature stopping, and permission-seeking" behavior never fired before March 8, then fired 173 times in the 17 days after.

Laurenzo's own words sharpen the picture.

"When thinking is shallow, the model defaults to the cheapest action available: edit without reading, stop without finishing, dodge responsibility for failures, take the simplest fix rather than the correct one. These are exactly the symptoms observed."

From a game programmer's angle, this data hurts in a specific way. A reads-per-edit of 6.6 is the signature of a workflow that goes "read headers, trace dependencies, grep for usages, read tests, then modify." On a complex codebase — imagine UE5 C++ with its web of headers, cpp files, USTRUCTs, and TMaps — having that number drop to 2.0 effectively means "patch and pray."

Laurenzo's team eventually moved to another provider. The line she left behind is the one that matters.

"Six months ago, Claude was unique in its reasoning quality and execution capabilities. Now, other competitors need to be very seriously considered and evaluated."

Anthropic's Rebuttal, and a Half-Admission

Boris Cherny from the Claude Code team showed up in the issue. His response mixes pushback and concession, and both halves are worth separating.

The pushback: the redact-thinking header shipped in March is a UI-only change. The actual reasoning still happens under the hood. It doesn't affect the thinking budget or the underlying reasoning mechanism. What Laurenzo measured is the length of redacted thinking signatures, so what she's seeing could be a loss of external observability rather than a real drop in reasoning.

The concession: two substantive changes did ship. On February 9, Opus 4.6 launched alongside adaptive thinking — instead of a fixed budget, the model now decides how much to think per turn. On March 3, the default effort level dropped from High to Medium (85 out of 100). Boris framed this as "a sweet spot on the intelligence-latency curve."

When users started sharing actual session transcripts, Boris moved further. He acknowledged that adaptive thinking appears to under-allocate reasoning on specific turns. The fixes he offered: /effort high or /effort max in the session, CLAUDE_CODE_EFFORT_LEVEL=max as an environment variable, and CLAUDE_CODE_DISABLE_ADAPTIVE_THINKING=1 to force a fixed budget.

Here's the compressed version. The model itself didn't get dumber, but the defaults quietly got lower, and users have to turn them back up manually. It's structurally the same as a car company lowering your engine's output and telling you to press the gas harder.

This framing is what connects directly to the April 21 Pro removal.

The Line: Quiet Degradation → Explicit Removal Experiment

View the Laurenzo issue and the Pro removal separately and they each look like annoying individual mishaps. View them together and the pattern appears.

Feb 9    Opus 4.6 + adaptive thinking ship
Mar 3    Default effort: High → Medium
Early Mar Thinking content redaction fully rolled out
Apr 2    Laurenzo files issue #42796
Apr 6    Boris's official response (UI-only claim + admits default drop)
Apr 21   Claude Code removed from Pro (~12 hours before rollback)

What this timeline shows is clear. Anthropic is trying to push down costs through two different mechanisms. One is quiet degradation — making the same price do less thinking. The other is explicit removal — pushing the same feature up into a more expensive tier.

Quiet degradation works until it gets caught. When it does, you can explain it as "that's a UI change, just flip your effort setting." But when someone like Laurenzo shows up with telemetry, that line stops working.

Explicit removal is the stronger card. A structural change cuts off future usage at the source. The downside is that it's visible the moment it lands. The instant a red X shows up on the Pro page, X and Hacker News and Reddit light up simultaneously. That's exactly what happened on April 21, and Anthropic backed out within half a day.

Running both cards at once isn't unusual — it's pretty standard price experimentation. The question is the order. The quiet card went first and didn't fully take, so the explicit card came next. Read Avasare's line one more time: "our current plans weren't built for this." That's not a 2% test sentence. That's a structural overhaul sentence.

Agent Economics: Why the Netflix Model Breaks

To understand why this is happening, you have to go one layer deeper. An AI subscription isn't Netflix.

Traditional SaaS like Netflix has near-zero marginal cost per user. One more signup rewatching House of Cards is a small bandwidth cost, not a new content production cost. In that model, power users are assets. They drive word-of-mouth, they lower churn, they validate the bundle.

Agent services are the opposite. Every time a user runs an agent, GPU time actually burns. A Sonnet response is a few cents, Opus is more, and a long-running agent making 200 tool calls can burn through several dollars a day per user. That math runs hot.

In this structure, power users aren't assets — they're liabilities. The more they use, the more the company loses on them. A user running Claude Code 8 hours a day on a $20 Pro plan consumes far more compute than their subscription pays for. Normally that gets subsidized by lighter users' subscriptions. But once long-running agent workflows become routine, the ratio of "light users" shrinks. The subsidy breaks.

This isn't Anthropic's problem alone. Sam Altman said last year that even the $200 ChatGPT Pro plan runs at a loss because of usage. OpenAI, Cursor, Replit — they're all hitting the same wall. Cursor moved to credits. Replit moved to effort-based billing. Google Gemini introduced hard caps. The whole industry is migrating to usage-based pricing at roughly the same time.

Anthropic's options, in broad strokes, are three. Split plans (a Pro Plus tier at $40–$50 between Pro and Max). Shrink what existing plans include (the Pro removal experiment). Or lower the model's defaults (the effort drop). Right now they're running the latter two in parallel while watching community reaction to see if they can move on the first.

From a Game Programmer's Angle: How to Read This Signal

If you're pushing AI coding tool adoption inside a company, this episode changes a few operational decisions.

The first is bringing measurement back in-house. The real lesson Laurenzo left isn't the 67% number itself — it's that she had the infrastructure to produce it. 6,852 session logs were sitting under ~/.claude/projects/, and she could parse the JSONL, correlate the length of thinking block signatures with actual content length (Pearson r=0.971), and run the analysis. Without that, the whole thing would have ended at "Claude feels off lately."

If your team is on Claude Code, it's worth collecting session logs somewhere and tracking at least the read-to-edit ratio. Anthropic has not officially committed to exposing thinking token counts in API responses. When vendors don't give you the metric, you have to build it.

The second is avoiding single-provider lock-in. This isn't "I hate Claude" — it's risk hedging. A team using only Claude Code right now has its dev process exposed to whatever pricing experiment Anthropic runs next. Codex is catching up fast. Local model options — DeepSeek, Qwen Coder — have meaningfully closed the gap for coding workloads. Keep Claude as primary, but keep a backup provider your team can actually run.

The third is pinning effort and budget settings explicitly. Now that adaptive thinking is the default, anything resembling complex engineering work should have /effort high or CLAUDE_CODE_DISABLE_ADAPTIVE_THINKING=1 baked into the team standard. Defaults can drop again, and the drop may not come with a loud announcement.

The fourth is distrusting "unlimited" marketing. The real lesson here: contracts with explicit numeric limits are safer than ones without. "Unlimited on the Pro plan" isn't a promise, it's copy. It gets redefined the moment usage patterns shift. A Max 20x hard cap of "N Opus calls per day" is, long-term, the more defensible contract.

What We Gained and What We Lost

One gain. Anthropic now knows where the community's line is. Rolling back in 12 hours means the reaction overshot their model. The next experiment will be built with this data, and it'll be smoother — probably as a Pro Plus tier.

One loss. A layer of trust. Read Laurenzo's line one more time.

"Six months ago, Claude was unique in its reasoning quality and execution capabilities. Now, other competitors need to be very seriously considered and evaluated."

This is a competitive take and a contract take at the same time. The moment you depend on a tool for your dev process, that vendor's pricing experiments become your team's productivity risk. Whether that's an acceptable risk has to be re-calculated every time an event like this hits.

April 21's 12 hours were Anthropic testing the range of what they can move. They were also us re-measuring how much we can depend.

Both numbers are worth keeping in mind.

"Our current plans weren't built for this." — That's not the end of the pricing experiment. That's the start.

Claude Code가 Pro에서 사라진 12시간 - Anthropic의 조용한 하향과 명시적 제거 사이

김이더 — Thu, 23 Apr 2026 10:14:21 +0000

원문 이슈는 GitHub #42796에, 관련 보도는 The Register에서 볼 수 있다.
더 많은 글은 radarlog.kr에서.

2026년 4월 21일 오후, Anthropic이 Claude Pro 요금제 페이지에서 Claude Code를 조용히 지웠다.

가격 페이지의 "Claude Code 포함" 문구가 사라졌고, 지원 문서의 "Using Claude Code with your Pro or Max plan"이 "Using Claude Code with your Max plan"으로 바뀌었다. 몇 시간 뒤 전부 원복됐지만, 이건 단발성 실수가 아니다.

2월의 adaptive thinking 도입, 3월의 effort 기본값 하향, 4월 6일 AMD 디렉터의 공개 분석 보고서, 그리고 4월 21일의 Pro 제거 실험까지 — 서로 떨어진 사건처럼 보이는 이 네 개의 점은 하나의 선으로 이어진다.

Anthropic은 long-running agent 시대의 경제학을 감당하지 못하고 있고, 감당하기 위해 여러 가지를 동시에 시도하고 있다.

그 시도들이 어떻게 연결되는지, 그리고 회사에 AI 코딩 도구를 도입하는 입장에서 이 시그널을 어떻게 읽어야 하는지를 정리해본다.

4월 21일 오후에 정확히 무슨 일이 있었나

Ed Zitron이 가장 먼저 포착했다. Anthropic의 가격 페이지에서 Pro 요금제의 "Claude Code 포함" 체크 표시가 빨간 X로 바뀌어 있었다. 지원 문서도 수정됐다. Claude Code 제품 페이지는 여전히 Pro를 언급하고 있었지만, 핵심 과금 페이지는 명확하게 "Max 전용"으로 돌아서 있었다.

The Register가 취재에 들어가자 Anthropic의 Head of Growth인 Amol Avasare가 X에 해명을 올렸다. "신규 prosumer 가입자의 약 2%를 대상으로 한 작은 테스트고, 기존 Pro/Max 구독자는 영향 없다."

그런데 그 해명의 뒷부분이 더 흥미롭다.

"우리가 1년 전 Max를 출시했을 때, Claude Code도 Cowork도 없었고 몇 시간씩 돌아가는 에이전트도 없었다. Max는 헤비한 채팅 사용을 위해 설계된 플랜이다, 그게 전부였다."

"그 뒤로 Claude Code를 Max에 번들링했고 Opus 4 이후 폭발적으로 성장했다. Cowork가 나왔다. 장시간 비동기 에이전트가 일상적인 워크플로우가 됐다. 사람들이 Claude 구독을 실제로 쓰는 방식이 근본적으로 바뀌었다."

2% 테스트라는 해명과 이 발언을 같이 놓고 보면, "작은 실험"이 아니다. 현재 요금제 구조 자체가 현재 사용 패턴을 감당 못 한다는 고백에 가깝다.

그리고 이 고백을 뒷받침하는 또 다른 사건이 2주 전에 이미 있었다.

2주 전: AMD AI 디렉터가 telemetry를 들고 나타났다

4월 2일, AMD의 AI 그룹 디렉터 Stella Laurenzo가 Claude Code GitHub 레포에 이슈 #42796을 올렸다. 제목: "Claude Code is unusable for complex engineering tasks with the Feb updates."

Laurenzo는 "요새 멍청해진 것 같다" 같은 감성 토크를 한 게 아니다. 자기 팀이 3개월간 쓴 Claude Code 세션 로그 6,852개, tool call 234,760개, thinking block 17,871개를 정량 분석해서 들고 왔다. 전 Google OpenXLA 리드이자 $200B급 반도체 기업 AI 그룹 수장이 GitHub 이슈를 여는 건, 그냥 감으로 하는 일이 아니다.

핵심 수치는 세 개다. 첫째, thinking depth가 2월 말부터 평균 67% 감소했다. 둘째, 파일 수정 전 코드를 읽는 비율(reads-per-edit)이 6.6에서 2.0으로 떨어졌다 — 70% 감소다. 셋째, "책임 회피, 중도 포기, 허락 요청" 같은 부적절 행동을 잡아내는 stop hook 스크립트가 3월 8일 이전엔 한 번도 발동하지 않았다가, 이후 17일간 173번 발동했다.

Laurenzo 본인의 문장으로 보면 그림이 더 선명해진다.

"Thinking이 얕아지면, 모델은 가장 싼 행동으로 기본값을 잡는다. 읽지 않고 수정하고, 끝내지 않고 멈추고, 실패의 책임을 회피하고, 올바른 수정 대신 가장 간단한 수정을 선택한다. 이것이 관찰된 증상과 정확히 일치한다."

게임 개발자 입장에서 이 데이터는 특히 아프다. read-to-edit 6.6이라는 숫자는 "헤더 읽고, 의존성 추적하고, grep으로 사용처 찾고, 테스트 읽고, 그제서야 수정한다"는 워크플로우의 서명이다. 복잡한 코드베이스 — 이를테면 UE5 C++처럼 헤더/cpp/USTRUCT/TMap이 얽힌 — 에서 이 워크플로우가 2.0으로 떨어진다는 건 사실상 "대충 고친다"는 뜻이다.

Laurenzo 팀은 결국 다른 프로바이더로 옮겼다고 썼다. 그러면서 남긴 말이 핵심이다.

"6개월 전, Claude는 추론 품질과 실행 능력에서 독보적이었다. 이제는 다른 경쟁자들을 매우 진지하게 고려하고 평가해야 한다."

Anthropic의 반박, 그리고 절반의 인정

Claude Code 팀의 Boris Cherny가 이슈에 등장했다. 그의 설명은 반박과 인정이 섞여 있어서, 양쪽을 나눠서 볼 필요가 있다.

반박 쪽은 이렇다. 3월에 도입된 redact-thinking 헤더는 UI-only 변경이고 실제 reasoning은 그대로 일어난다. 모델의 thinking budget이나 추론 실행 메커니즘에는 영향을 주지 않는다. Laurenzo가 분석한 지표는 redacted thinking block의 길이를 추정한 것이라, 실제 추론량이 아니라 측정 가능한 외부 신호가 끊긴 결과일 수 있다.

인정 쪽은 이렇다. 2월에 두 가지 실질적 변경이 있었다. 첫째는 2월 9일 Opus 4.6 출시와 함께 도입된 adaptive thinking — 고정된 thinking budget 대신 모델이 매 턴마다 얼마나 생각할지 스스로 결정하는 방식이다. 둘째는 3월 3일, 기본 effort level을 High에서 Medium(100점 만점에 85)으로 내린 것이다. Boris는 이를 "지능과 지연시간 곡선의 sweet spot"이라고 프레이밍했다.

사용자들이 실제 대화 로그를 공유하자 Boris는 한 발 더 물러섰다. adaptive thinking이 특정 턴에서 reasoning을 과소할당하는 것 같다고 인정했다. 해결책으로 제시한 건 /effort high나 /effort max 명령어, 환경변수 CLAUDE_CODE_EFFORT_LEVEL=max, 그리고 CLAUDE_CODE_DISABLE_ADAPTIVE_THINKING=1이다.

요약하면 이렇다. 모델 자체가 멍청해진 건 아니지만, 기본 설정이 조용히 내려갔고, 사용자는 매뉴얼하게 되돌려야 한다. 자동차 회사가 엔진 출력을 낮춰놓고 "액셀을 더 세게 밟으라"고 하는 것과 구조적으로 같다.

그리고 바로 이 프레임이, 4월 21일 Pro 제거 실험과 하나의 선으로 연결된다.

하나의 선: 조용한 하향 → 명시적 제거 실험

Laurenzo 이슈와 Pro 제거 사건을 따로 보면 각각 그냥 짜증나는 사고처럼 보인다. 같이 놓으면 패턴이 보인다.

2월 9일   Opus 4.6 + adaptive thinking 도입
3월 3일   기본 effort level: High → Medium
3월 초    thinking content redaction 전면 적용
4월 2일   Laurenzo 이슈 #42796 공개
4월 6일   Boris의 공식 답변 (UI-only 주장 + 기본값 하향 인정)
4월 21일  Pro 요금제에서 Claude Code 제거 (~12시간 후 롤백)

이 타임라인이 말하는 건 분명하다. Anthropic은 두 개의 다른 방식으로 비용을 누르려 시도 중이다. 하나는 조용한 하향 — 같은 가격에 덜 생각하게 만드는 것. 다른 하나는 명시적 제거 — 같은 기능을 더 비싼 티어로 밀어 올리는 것.

조용한 하향은 발각되기 전까지 효과적이다. 발각되면 "이건 UI 변경이고 effort 설정을 바꾸면 됩니다"라고 해명할 수 있다. 하지만 Laurenzo 같은 사람이 telemetry를 들고 나타나면 반박이 어렵다.

명시적 제거는 더 강한 카드다. 구조를 바꾸면 앞으로의 사용량을 원천 차단할 수 있다. 그런데 단점은 즉시 눈에 띈다는 점이다. Pro 페이지에 X 표시가 뜨는 순간 X, Hacker News, Reddit이 동시에 터진다. 4월 21일에 정확히 그렇게 됐고, Anthropic은 반나절 만에 손을 뗐다.

두 카드를 같이 쓰는 게 이상한 일은 아니다. 정상적인 가격 실험이기도 하다. 문제는 조용한 카드가 먼저 시도됐고, 그게 안 먹힌 다음에 명시적 카드가 나왔다는 순서다. Avasare의 발언을 한 번 더 보자. "우리 플랜은 이런 사용량을 위해 설계된 게 아니다." 이건 "2% 테스트" 스케일의 문장이 아니다. 구조 개편 스케일의 문장이다.

에이전트 경제학: Netflix 모델이 깨지는 이유

왜 이런 일이 벌어지는지 이해하려면 한 단계 더 들어가야 한다. AI 구독 서비스는 Netflix가 아니다.

Netflix 같은 전통적 SaaS는 사용자가 늘어도 한계비용이 거의 0이다. 한 명이 더 가입해서 House of Cards를 다시 보는 건 서버 대역폭 조금 더 쓰는 거지, 콘텐츠 제작비가 새로 드는 게 아니다. 그래서 파워 유저는 자산이다. 입소문을 내주고, 이탈률을 낮춰주고, 번들 가치를 증명해준다.

AI 에이전트 서비스는 정반대다. 사용자가 에이전트를 한 번 돌릴 때마다 GPU 시간이 실제로 소모된다. Sonnet 응답 하나는 몇 센트, Opus는 더 비싸고, 장시간 에이전트가 tool call 200번을 돌리면 사용자 한 명에게서 하루에 몇 달러가 사라질 수 있다.

이 구조에서 파워 유저는 자산이 아니라 부채다. 많이 쓸수록 회사가 손해를 본다. $20 Pro 플랜으로 하루 8시간 Claude Code를 돌리는 사용자는, 구독료보다 훨씬 많은 compute를 소비한다. 그걸 다른 가벼운 사용자들의 구독료로 보조하는 구조인데, 장시간 에이전트 워크플로우가 일상화되면 "가벼운 사용자"의 비율이 줄어든다. 보조가 안 먹힌다.

이건 Anthropic만의 문제가 아니다. Sam Altman도 작년에 ChatGPT Pro $200 플랜이 사용량 때문에 적자라고 공개적으로 말했다. OpenAI도, Cursor도, Replit도 전부 같은 벽에 부딪히고 있다. Cursor는 크레딧 기반으로, Replit은 effort 기반으로, Google Gemini는 하드캡으로 — 업계 전체가 일제히 사용량 기반 과금으로 움직이는 중이다.

Anthropic의 선택지도 크게 보면 세 가지다. 플랜을 쪼개거나(Pro Plus $40~50 같은 중간 티어), 기존 플랜 기능을 축소하거나(Pro 제거 실험), 모델 자체의 기본값을 낮추거나(effort 하향). 지금까지는 뒤의 두 개를 동시에 시도 중이고, 커뮤니티 반응을 보면서 첫 번째를 만지작거리고 있다고 보는 게 합리적이다.

게임 개발자 관점에서: 이 시그널을 어떻게 읽어야 하나

회사에 AI 코딩 도구 도입을 주도하고 있는 입장에서, 이번 사건은 몇 가지 운영상의 결정을 새로 검토하게 만든다.

첫 번째는 측정을 돌려주는 것이다. Laurenzo가 이번 사건에서 남긴 진짜 교훈은 67%라는 숫자 자체가 아니라, 그 숫자를 뽑아낼 수 있는 인프라를 갖고 있었다는 점이다. 6,852개 세션 로그가 ~/.claude/projects/ 아래에 쌓여 있었고, JSONL을 파싱해서 thinking block의 signature 필드 길이와 실제 content 길이의 상관관계(Pearson r=0.971)를 계산할 수 있었다. 이게 없었으면 "요새 좀 이상한데"로 끝났을 거다.

팀에 Claude Code를 깔아두고 있다면, 최소한 세션 로그를 어딘가에 모으고 read-to-edit 비율 정도는 추적할 수 있는 구조를 만들어두는 게 낫다. Anthropic이 thinking token 지표를 API 응답에 노출하겠다는 공식 약속은 아직 없다. 지표를 주지 않으면 스스로 만들어야 한다.

두 번째는 프로바이더 단일 의존을 피하는 것이다. 이건 "Claude 싫어"가 아니라 리스크 헤징의 문제다. 지금 이 시점에 Claude Code만 쓰는 팀은, Anthropic의 요금제 실험 하나에 개발 프로세스가 흔들린다. Codex가 빠르게 따라붙고 있고, 로컬 모델 옵션도 — DeepSeek이나 Qwen Coder 같은 — coding 성능에서 의미 있게 좁혀왔다. 메인은 Claude로 가되, 백업 프로바이더 하나를 팀이 실제로 돌릴 수 있는 상태로 유지하는 게 안전하다.

세 번째는 effort/budget 설정을 명시적으로 고정하는 것이다. adaptive thinking이 기본값이 된 이상, 복잡한 작업을 돌릴 때는 /effort high 또는 CLAUDE_CODE_DISABLE_ADAPTIVE_THINKING=1을 팀 표준에 박아두는 게 낫다. 기본값이 언제 다시 내려갈지 모르고, 내려가도 즉시 공지가 없을 수 있다.

네 번째는 "무제한" 마케팅을 불신하는 것이다. 이번 사건의 진짜 교훈은 이거다. 명시적 숫자 제한이 있는 계약이 오히려 안전하다. "Pro 플랜 무제한"은 약속이 아니라 마케팅 문구고, 사용 패턴이 바뀌면 언제든 재정의된다. Max 20x의 "하루 N회 Opus 호출" 같은 명시적 한도가 장기적으로는 더 방어 가능한 조건이다.

우리가 얻은 것과 잃은 것

이번 사건에서 얻은 게 하나 있다. Anthropic이 이제 무엇이 커뮤니티의 한계선인지를 알게 됐다. 12시간 만에 롤백한 건, 커뮤니티 반응이 그들의 모델링을 넘어섰다는 증거다. 다음 실험은 이 경험을 학습한 상태에서 나올 거고, 그건 더 부드러울 거다. 아마 Pro Plus 티어 형태로.

잃은 것도 있다. 신뢰의 계층 하나다. Laurenzo가 남긴 문장을 다시 본다.

"6개월 전, Claude는 추론 품질과 실행 능력에서 독보적이었다. 이제는 다른 경쟁자들을 매우 진지하게 고려하고 평가해야 한다."

이건 경쟁 구도 판단이면서 동시에 계약 관점의 판단이다. 도구 하나에 개발 프로세스를 의존하는 순간, 그 도구의 제공자가 요금제를 어떻게 실험할지가 자기 팀의 생산성을 좌우한다. 그게 받아들일 만한 위험인지 아닌지는, 이번 같은 사건이 한 번씩 터질 때마다 재계산해야 한다.

4월 21일의 12시간은 Anthropic이 움직일 수 있는 범위를 테스트한 시간이었고, 동시에 우리가 어디까지 의존할 수 있는지를 재보는 시간이었다.

두 숫자를 모두 기억해둘 가치가 있다.

"우리 플랜은 이런 사용량을 위해 설계된 게 아니다." — 이건 가격 실험의 끝이 아니라 시작이다.

Claude Opus 4.7이 나왔다 — 게임 프로그래머가 본 실전 변화

김이더 — Sun, 19 Apr 2026 08:29:28 +0000

더 많은 글은 radarlog.kr에서.

2026년 4월 16일, Anthropic이 Claude Opus 4.7을 일반 공개했다.

Opus 4.6이 2월에 나왔으니 딱 두 달 만이다. 2개월 사이클이 굳어지는 분위기다.

게임 프로그래머 관점에서 이번 릴리스가 왜 중요한가. Claude Code를 UE5 C++ 작업에 쓰는 사람이라면 모델이 바뀔 때마다 프롬프트 튜닝, 하네스 설계, 토큰 예산을 다시 잡아야 한다. 공식 발표글과 마이그레이션 가이드를 기준으로, 게임 개발 업무 관점에서 뭐가 달라지는지 정리한다.

한 줄 요약: "긴 태스크를 던져도 되는 모델"

발표글에서 가장 먼저 눈에 들어오는 문장은 이거다.

"Users report being able to hand off their hardest coding work—the kind that previously needed close supervision—to Opus 4.7 with confidence."

감시 없이 맡길 수 있다는 얘기다. 이게 과장인지 실제인지는 써봐야 안다. 다만 벤치마크 숫자가 그 방향을 뒷받침한다.

Cursor의 CursorBench에서 Opus 4.6은 58%였지만 Opus 4.7은 70%를 넘겼다. Rakuten-SWE-Bench에서는 프로덕션 태스크 해결률이 3배로 뛰었다. Devin을 만드는 Cognition 쪽은 "몇 시간 동안 코드 일관성을 유지한다"고 코멘트를 남겼다.

감각으로 풀면 이렇다. Opus 4.6까지는 한 덩어리 태스크를 주고 30~40분 뒤에 돌아와서 "뭐 이상한 거 없나" 보는 게 안전했다. 4.7은 두세 시간짜리도 던져볼만 하다는 얘기다.

UE5 작업에서 이게 뭘 의미하는지는 분명하다. 서브시스템 리팩토링, USTRUCT 해시 재구현, 여러 에셋을 가로지르는 enum 삭제 — 이런 태스크를 던져놓고 회의 다녀오는 워크플로우가 현실적으로 된다.

xhigh 이펙트 레벨 — 지금 당장 바꿔야 하는 설정

Opus 4.7이 들고 온 가장 중요한 API 변화는 xhigh다.

기존에는 effort 레벨이 low, medium, high, max 네 개였다. 여기에 high와 max 사이로 xhigh가 추가됐다. "extra high"라는 뜻이다.

그리고 Claude Code의 모든 플랜에서 기본 effort가 xhigh로 올라갔다. 공식 가이드도 이렇게 권장한다.

"When testing Opus 4.7 for coding and agentic use cases, we recommend starting with high or xhigh effort."

이게 왜 중요한가. 지금까지는 Claude Code에서 "답변이 얕다" 싶을 때 /ultrathink 같은 패턴으로 우회하는 경우가 있었다. 이제는 기본값 자체가 더 깊이 생각하는 쪽으로 옮겨갔다.

대신 트레이드오프가 있다. Hex의 CTO는 이렇게 표현했다.

"low-effort Opus 4.7 is roughly equivalent to medium-effort Opus 4.6."

같은 품질을 얻고 싶으면 effort를 한 단계 낮춰도 된다는 얘기다. 돈이 빡빡한 팀은 이 지점을 활용할 수 있다. 진짜 어려운 태스크엔 xhigh를 쓰고, 평범한 수정엔 low로 떨어뜨리는 식으로 레벨을 과감하게 쓰는 게 4.7 체감을 바꾸는 포인트다.

실전에서 어떻게 쓸지 정리하면 이렇다.

# 코딩 태스크 - 기본값은 xhigh
response = client.messages.create(
    model="claude-opus-4-7",
    thinking={"type": "adaptive"},
    output_config={"effort": "xhigh"},
    messages=[...]
)

# 간단한 리팩토링이나 포맷팅 - low로 충분
response = client.messages.create(
    model="claude-opus-4-7",
    thinking={"type": "adaptive"},
    output_config={"effort": "low"},
    messages=[...]
)

일반적인 가이드라인으로 풀면 기본값은 xhigh, 단순 작업은 low로 내리는 게 맞다. medium을 쓸 상황이 거의 없어진다.

토크나이저가 바뀌었다 — 같은 입력이 1.35배로 팽창할 수 있다

이건 조용히 쓰면 월말 청구서에서 놀라는 변화다.

공식 가이드 원문이다.

"Opus 4.7 uses an updated tokenizer that improves how the model processes text. The tradeoff is that the same input can map to more tokens—roughly 1.0–1.35× depending on the content type."

토크나이저가 교체됐다. 더 똑똑해졌지만, 콘텐츠 타입에 따라 같은 입력이 최대 1.35배 토큰으로 매핑된다.

게임 개발자한테 이게 왜 치명적인가. UE5 C++ 코드베이스를 컨텍스트로 넣을 일이 많다. .h/.cpp 쌍, Blueprint 더미, USTRUCT 정의, Slate 위젯 — 이런 코드 덩어리들이 정확히 "토큰 팽창이 잘 일어나는 콘텐츠"다.

게다가 두 번째 변화가 더 있다.

"Opus 4.7 thinks more at higher effort levels, particularly on later turns in agentic settings."

에이전틱 세팅에서, 특히 뒷턴일수록 더 많이 생각한다. 출력 토큰도 늘어난다는 뜻이다.

대응은 두 갈래다.

먼저 큰 작업을 던질 때 프롬프트에 "be concise"를 넣거나 task budgets 베타를 쓴다. task budgets는 이번에 같이 공개된 기능인데, 에이전트가 긴 런에서 토큰을 어떻게 배분할지 개발자가 가이드할 수 있다. 퍼블릭 베타 상태라 지금 바로 써볼 수 있다.

다음으로 CLAUDE.md 같은 항상 들어가는 컨텍스트를 다시 훑는다. 4.6까지 괜찮던 사이즈가 4.7에선 빡빡해질 수 있다. 안 쓰는 섹션 덜어내고, 예시 코드 더 짧게 자른다.

비전이 세진다 — 2,576px, 3.75메가픽셀

이건 컴퓨터 유즈 에이전트 만드는 사람한테 진짜 큰 변화다.

Opus 4.7은 긴 축 기준 2,576픽셀, 대략 3.75메가픽셀까지 받는다. 이전 Claude 모델들의 3배가 넘는 해상도다.

왜 중요한지 실전 예를 들면 이렇다. UE5 에디터 스크린샷을 Claude한테 주고 "이 Details 패널 상태를 그대로 재현해줘"라고 하면, 4.6까지는 다운샘플링 때문에 체크박스나 숫자가 뭉개져서 에이전트가 읽지 못했다. 4.7에선 픽셀 단위로 본다.

XBOW라는 자동 침투 테스트 회사 사례가 극단적이다. 그들의 비주얼 예민도 벤치마크에서 Opus 4.6은 54.5%였는데 4.7은 98.5%를 찍었다. 거의 두 배다.

게임 쪽에서 써먹을 지점이 여러 개 나온다. UMG/Slate 디자인을 스크린샷으로 주고 코드를 받는 워크플로우. Unreal Insights나 프로파일러 결과 이미지를 읽혀 핫패스를 찾는 패턴. 커밋 diff 스크린샷 리뷰. 스프라이트/애니메이션 결과물의 시각적 디버깅. 전부 4.6까진 해상도 문제로 빠지던 케이스들이다.

Memory — 파일시스템 기반 메모리를 더 잘 쓴다

발표글에 이런 문장이 있다.

"Opus 4.7 is better at using file system-based memory. It remembers important notes across long, multi-session work."

모델 자체에 망각 곡선을 심는 게 아니라, 파일시스템에 노트를 남기고 에이전트가 그걸 다시 읽는 구조를 강화했다는 얘기다.

Claude Code 유저라면 이 패턴이 낯설지 않다. 세션이 길어지면 중간에 NOTES.md나 CONTEXT.md 같은 걸 만들어두고, 새 세션에서 그걸 다시 읽으면서 상태를 복구한다. 4.7은 이 루틴을 더 능숙하게 돌린다.

게임 개발 프로젝트에서 이건 꽤 큰 의미가 있다. UE5 프로젝트는 컨텍스트가 방대하다. 네트워킹 규약, USTRUCT 레이아웃, Blueprint 호출 규칙, 렌더링 파이프라인 제약. 한 세션에 다 넣을 수 없다.

실전 패턴은 이미 알려져 있다. docs/architecture/ 밑에 주제별로 memory 파일을 쪼개둔다. 세션 시작할 때 "오늘은 전투 시스템 작업이다"라고 하면 에이전트가 관련 memory 파일만 골라 읽는다. 4.7은 이걸 덜 힘들이고 한다.

모델 버전이 올라갈수록 "Memory 파일 구조 설계" 역량이 점점 중요해진다. 단순히 CLAUDE.md 하나에 몰아넣는 시대는 지났다.

/ultrareview와 auto mode — Claude Code 자체도 같이 바뀐다

Opus 4.7 출시와 함께 Claude Code에 두 가지 기능이 붙었다.

먼저 /ultrareview 슬래시 커맨드다. 전용 리뷰 세션을 열어서 변경사항을 읽고, 꼼꼼한 리뷰어가 잡을 만한 버그와 설계 이슈를 플래그한다. Pro/Max 유저한테 세 번 무료 시도를 준다고 한다.

팀 컨벤션이 복잡한 프로젝트일수록 /ultrareview 같은 전용 모드가 잘 먹힌다. 일반 채팅에서는 대충 훑고 지나가는 것들을 리뷰 모드에선 더 깊게 본다. UE5처럼 헤더/소스/Build.cs/Config이 얽혀 있는 구조에선 특히 유용할 가능성이 크다.

그 다음은 auto mode의 Max 유저 확대다. auto mode는 Claude가 "이 작업은 권한을 물어봐야 하나?"를 스스로 판단하는 기능이다. 전부 허용하는 것보다 안전하고, 매번 묻는 것보다 빠르다.

게임 개발에서 이게 왜 유용한가. UE5 프로젝트는 파일이 수백 개다. 한 작업에서 .h, .cpp, .Build.cs, Config/*.ini까지 건드려야 할 때가 많다. 매번 권한을 물으면 흐름이 깨진다. 그렇다고 전부 허용하면 .uproject 같은 핵심 파일이 망가질 위험이 있다.

auto mode는 "위험한 파일은 물어보고, 나머지는 진행"이라는 중간 지점을 잡는다. 이론상 게임 프로젝트 워크플로우에 잘 맞을 것 같은 구조다.

Mythos Preview 얘기 — Opus 4.7이 왜 "중간 단계"인가

이번 릴리스를 이해하려면 Mythos Preview 얘기를 빼놓을 수 없다.

지난주 Anthropic은 Claude Mythos Preview를 공개하면서 Project Glasswing이라는 이니셔티브를 같이 발표했다. Mythos는 Anthropic의 가장 강력한 모델인데, 특히 사이버 보안 쪽 능력이 압도적이라 일반 공개를 하지 않는다. 대신 AWS, Apple, Google, Microsoft, Cisco, CrowdStrike, JPMorgan, Linux Foundation, NVIDIA, Palo Alto Networks 같은 핵심 인프라 파트너들한테만 제한적으로 풀었다.

왜 이렇게까지 했냐. Mythos는 이미 모든 주요 OS와 브라우저에서 수천 개의 제로데이를 찾아낸 상태다. OpenBSD에서 27년 묵은 버그도 찾았다고 한다. Nicholas Carlini가 "지난 몇 주간 평생 찾은 것보다 더 많은 버그를 찾았다"고 했을 정도다. 이 능력이 공격자 손에 들어가는 걸 피하려고 한정 배포 트랙을 만든 거다.

가격도 Opus 4.7과 완전히 다르다. Mythos Preview는 $25/$125 per million tokens (입력/출력)이고, Opus 4.7은 $5/$25다. 5배 차이 난다.

여기서 Opus 4.7의 포지션이 선명해진다. Anthropic은 원래 "다음 Opus 모델에 새 사이버 가드레일을 실어서 테스트할 것"이라고 예고했었는데, 그 모델이 바로 4.7이다. Mythos급 모델을 나중에 넓게 풀려면 먼저 안전장치가 실전에서 작동해야 한다. 4.7은 일부러 사이버 능력을 낮춘 상태로 그 가드레일을 검증하는 샘플이다.

이게 게임 개발자한테 실전적으로 뭘 의미하는가.

대부분의 게임 개발 작업에는 영향이 없다. 네트워크 프로토콜 설계, 치트 방어, 서버 보안 — 이런 작업은 "방어"쪽이라 일반적으로 통과한다. 다만 보안 소프트웨어가 끼어드는 환경에서 디버깅을 하다 보면 "보안 소프트웨어를 우회하는 방법"으로 오해받을 여지가 있는 질문이 생긴다. 이럴 때 좀 더 섬세하게 맥락을 설명해야 할 수 있다.

정당한 보안 연구자를 위해 Cyber Verification Program이라는 별도 트랙이 열렸다. 레드팀이나 취약점 연구 쪽 종사자라면 신청해볼 만하다.

그리고 이게 진짜 중요한 포인트인데, Anthropic의 로드맵이 이 릴리스에서 드러났다는 점이다. Mythos는 당장 퍼블릭 릴리스 계획이 없지만, "Mythos-class 모델을 언젠가 안전하게 대규모 배포하는 것"이 목표라고 명시했다. Opus 4.7은 그 길의 첫 번째 정거장이다. 4.8이든 5.0이든 다음 모델은 Mythos 능력에 더 가까워지면서 더 정교한 가드레일을 태우고 나올 거다.

업그레이드 전 체크리스트

이번 릴리스를 받고 해야 할 일을 정리하면 이렇다.

CLAUDE.md나 항상 컨텍스트에 들어가는 파일들을 다시 훑는다. 토크나이저가 바뀌면서 예전 사이즈가 과해질 수 있다. 안 쓰는 섹션 정리하고, 예시 코드도 짧게 자른다.

Claude Code 설정에서 effort 기본값을 xhigh로 올리고, 단순 작업용 low 프리셋을 따로 만든다. 양극단을 쓰는 게 4.7 운영의 핵심이다.

/ultrareview를 PR 하나에 적용해본다. 기존 리뷰 대비 뭐가 더 잡히는지, 얼마나 걸리는지 측정.

auto mode를 테스트한다. 위험한 파일에서 실제로 멈추는지, 아니면 그냥 넘어가는지.

마지막으로 이게 제일 중요한데, 기존 프롬프트를 재검토한다. Anthropic이 명시적으로 경고했다.

"prompts written for earlier models can sometimes now produce unexpected results: where previous models interpreted instructions loosely or skipped parts entirely, Opus 4.7 takes the instructions literally."

지시를 문자 그대로 받는다. 4.6 때는 "적당히 알아서 하겠지" 하고 쓴 프롬프트가 4.7에선 문제를 일으킬 수 있다. 특히 CLAUDE.md에 들어간 지시문들을 다시 봐야 한다. "적절히", "필요하면", "가능한 한" 같은 모호한 표현이 많으면 4.7은 그걸 문자 그대로 해석하려고 하다가 엉뚱한 결과를 낸다.

모델이 바뀌면 프롬프트도 바뀌어야 한다. 하네스는 모델 위에 쌓는 게 아니라, 모델과 함께 다시 짜는 거다.

I Built My Own Year-End Review for AI Coding — Memradar Code Report

김이더 — Fri, 17 Apr 2026 07:54:21 +0000

Live app at memradar.vercel.app. Code on GitHub.
More posts at radarlog.kr.

Every December Cursor pops up with something like "you coded X hours this year, across these languages." GitHub drops Year in Review. Discord puts yearly stats on your profile.

The numbers are all mine, but flipping through them, I catch myself smiling — "oh, that's how I was." That's what a year-end review does. It's not information delivery — it's a ritual that pulls out one number at a time.

I wanted that for myself. Claude Code and Codex drop JSONL logs into my home directory every day (~/.claude/projects/, ~/.codex/sessions/), and I had never looked inside. My conversations are on my disk, and I couldn't read them.

Memradar is a local tool that turns those JSONLs into a retrospective. One line (npx memradar), a browser opens, and along with the dashboard there's a full-screen slide retrospective — Code Report — built in.

This post is about the details I got obsessive about while building it. The idea is simple, but making people actually feel something while looking at their own data took a lot of small calls.

It had to be Code Report, not Wrapped

When I first built the full-screen retrospective feature, I called it "Wrapped" internally. The folder is still src/components/wrapped/. Spotify Wrapped was the reference, so the name stuck.

But shipping a product literally named "Wrapped" has problems. You're borrowing someone else's brand. The word also pins the feature to "year-end summary" as a category.

While renaming, I wrote this principle into docs/UI-UX-PRINCIPLES.md:

9. The dashboard and the Code Report are two moods
   of the same product
- The dashboard is for exploration; the Code Report is
  for emotional retrospective and sharing.
- The Code Report uses its own palette, its own typography,
  and full-screen narrative.

Same data, but the dashboard should feel like an "analysis tool" — calm — while the Code Report should lean into emotion, like a "retrospective experience." I separated the palettes and typography so the two screens don't bleed into each other.

So the name became Code Report. A report on the code (my AI coding logs). Not tied to a fixed calendar moment — it's a screen that's a retrospective whenever you open it. Wrapped happens once a year; Code Report opens whenever I need it.

I kept the folder name (wrapped/) as-is. Renaming internal variables carries too much refactor risk for no real benefit — only the product-facing name got locked to Code Report. That split itself is a small lesson: the inside and the outside of the same feature can have different names.

Why the dashboard alone wasn't enough

The dashboard came first, naturally. Heatmap, hourly chart, word cloud, session browser. Every metric on one screen.

I loaded my own logs. The numbers were all there. And I felt nothing.

That's the dashboard ceiling. Lots of information, zero emotion. You nod and close the tab. The reason Cursor's year-end review makes people laugh with the same kind of data is that each screen shows one number. The space around that number is empty on purpose.

I locked in a composition principle for Code Report — "one scene, one message." From docs/UI-UX-PRINCIPLES.md:

- Here, "one scene, one message" matters more
  than the dashboard's rules.
- Full-screen, heavy whitespace, big type,
  a dedicated dark story palette.
- End with a shareable image —
  the end of the emotional arc is also the call to action.

There are eight slides. Intro with the first session date. Total prompts. Favorite model. Coding hours. Top tools ranking. Personality type. Usage. Shareable card at the end.

Each slide holds one number or one message. "I talked to Claude 3,200 times" as a small figure in a dashboard corner is one thing. "3,200" counting up on an empty slide is completely another. Same data.

That was the first obsession. How do you make a number be felt, not just read?

Why I scrapped 4 personality types for 8

The climax of Code Report is the personality slide. I started with four: Architect, Speed Runner, Explorer, Night Sage.

After a couple of runs on different logs, everyone said some version of "...I don't think I'm any of those." MBTI works because its four axes are computed independently — you're I, and N, and F, and P — all at once — which is how INFP appears. One axis (pick one of four buckets) is always going to be coarse.

So I rewrote it as three axes.

style × scope × rhythm
 ↓       ↓        ↓
care    depth    pace
  8 combinations

The combinations yield names like "Deep-Sea Diver," "Lightning Fixer," "Chaos Creator." Because each axis is computed separately, the result feels more personal.

Half a day burned on this. The four types weren't wrong — the feeling of reading the result was weak. That's what matters most in Code Report. Before "correct," the viewer has to chuckle and think "yeah, that's me."

I kept tuning the personality logic against real data. Did my logs produce a result that felt right for me? Did a friend's logs produce something that fit them? Move one threshold and everything shifts. I calibrated that dozens of times.

That 2.5-second delay on the last slide

Commit title: Fix last slide dashboard prompt timing.

The problem: on the share-card slide, tapping the screen opens a "Go to dashboard?" modal. But people flip through Code Report with a rhythm — tap, tap, tap — and that same rhythm triggers the modal the moment they land on the last slide. They never even see the share card.

Most people would shrug at this. But once it happened to me, I couldn't unsee it.

useEffect(() => {
  setDashboardPromptReady(false)
  if (slideIndex !== lastSlideIndex) return

  const timer = window.setTimeout(() => {
    setDashboardPromptReady(true)
  }, 2500)

  return () => window.clearTimeout(timer)
}, [lastSlideIndex, slideIndex])

When you enter the last slide, dashboardPromptReady stays false for 2.5 seconds. Tap all you want — the modal won't open. I even changed the cursor to default during that window. The mouse cursor itself signals "not yet."

Writing it, I asked myself: is this really needed? The user won't even notice.

And that's exactly the point. Same idea as locking input for a few frames right after a cutscene ends in games. You prevent the player's "skip button rhythm" from triggering something they didn't mean to. When it bites, it feels awful; when it's handled, nobody notices. Nobody noticing is the win.

Why 20 themes?

Memradar ships with 20 themes. Four backgrounds (Dark, AMOLED, Light, Warm) × five accents (Indigo, Violet, Teal, Rose, Amber).

Why so many? Same reason as everything else — it's about feel.

Flipping through Code Report is personal. It's me looking at my own coding log. I don't want to use someone else's dark theme — I want to pick my own mood. Unlike Cursor's year-end review where the palette is fixed, if I'm the one building it, I should be able to choose.

The four backgrounds are calculated, too. Dark as the default. AMOLED for a true black on OLED screens. Light for a bright café. Warm for evening. The right pick changes with the situation.

Fonts are pinned to Noto Sans KR + Noto Serif KR. The UI shows a lot of Hangul, so the Korean has to look right first. I didn't pick an English-first font. I code mostly in Korean, my prompts are mostly Korean, and the Code Report that reads those has to look good in Korean.

localStorage preserves the choice across visits. Obvious, but forgetting it is the kind of thing that screams amateur.

The decisions packed into one heatmap

A GitHub-style daily activity heatmap. Here are the calls inside that single widget.

First, responsive cell size. Cells auto-grow and shrink with container width. A ResizeObserver watches the container; on change, cell size is recomputed and the grid rerenders. I started with fixed sizes, then watched the heatmap get clipped on a smaller laptop and rebuilt it.

Second, click-to-pin a date. Clicking a cell shows that day's session summary in a side panel. Hover-only worked on desktop but died on mobile, and "pinning" a day on desktop lets you actually dig in.

Third, streak counter. "How many days in a row." I debated where to place it, and landed on the heatmap's side panel. Not trying to gamify like Duolingo — just a passive "huh, 15 straight days this month" when your eyes already land on the heatmap.

Fourth, day-of-week pattern. Which weekday I code the most, tucked small on the side. I thought about cutting it. Then I noticed I work weekends way more than I thought. That's the Code Report flavor — surfacing a pattern I didn't see.

Four decisions inside one widget. Same story for the word cloud, the hourly chart, the token chart.

How many times I rewrote the DropZone wording

The most obsessive stretch was the landing page DropZone. The commit log has four wording fixes for one component:

- Replace copy-paste wording with Ctrl+C/V shortcut
- Use Ctrl+C/V/Enter wording consistently in DropZone
- Allow .claude/.codex root folder drop and add install guide link
- Fix DropZone wording: shorten Ctrl+C/V label

One at a time. First, "copy and paste" was the original wording — but it's long in both languages. Two shortcuts side by side make the action click in half a second.

Second was consistency. "Ctrl+C/V" in some places, "copy/paste" left in others. Mixed vocabulary annoys people.

Third, accept the .claude folder itself. Originally you had to drop the inner .claude/projects/. But users naturally drag .claude from their home. So the drop handler digs one level deeper when it sees .claude:

if (droppedFolder.name === '.claude') {
  const projectsDir = findChild(droppedFolder, 'projects')
  if (projectsDir) return scanDir(projectsDir)
}

Fourth, wording again. The "Ctrl+C/V" label had a redundant prefix stuck in front, so I shortened it.

Four passes on one component's copy. Over the top? Maybe. But the DropZone is the first screen a user sees. If they can't figure out what to do in five seconds, the rest doesn't matter.

Bilingual support turned out to be a big call

i18n (ko/en), theme presets, and hash routing. Three things landed in one commit.

I added bilingual support partly because I publish this blog in both Korean and English. But there's a bigger reason. The copy in Code Report is emotional. "You're a Night Owl." "You started 3,200 conversations this month." If those only exist in English, the joy gets cut in half for Korean readers.

// src/i18n.tsx
const translations = {
  ko: {
    'personality.nightSage': '새벽의 현자',
    'personality.speedRunner': '번개 해결사',
    // ...
  },
  en: {
    'personality.nightSage': 'Night Sage',
    'personality.speedRunner': 'Lightning Fixer',
    // ...
  }
}

Translating Night Sage as "새벽의 현자" (sage of the dawn, not just night) and Lightning Fixer as "번개 해결사" wasn't literal — I picked each Korean name so it lands at the same temperature the English version does. Writing the eight personality names in both languages took hours.

Hash routing shipped alongside for a concrete reason: sharing Code Report links means state has to live in the URL. #wrapped/5 puts the slide index right there. Refreshing keeps you on the same slide.

The moment I threw single-HTML away

The biggest technical call was flipping the CLI from single-HTML to a local server.

Originally the CLI parsed every JSONL, serialized the result to JSON, inlined it into a <script> tag, and produced a single giant HTML file. No server, works offline, one file to open. Clean.

Then my .claude/projects/ grew up. The HTML passed 18MB. The browser froze for 4-5 seconds just running JSON.parse. Code Report was already dead before it started.

So I flipped it. A local server on port 3939 (http.createServer) plus /api/sessions streaming. The browser pulls a 0.6MB app bundle first. Sessions arrive in batches of ten from the server. Session bodies load only on click.

It's the same shift as whole-world loading vs level streaming in UE5. You used to load the entire map into RAM. Modern open worlds stream only chunks near the player. Disk holds everything; RAM holds only what's needed. Exactly that move.

I kept single-HTML around behind a --static flag. For a handful of sessions, single-HTML is still the simplest thing that works. Pick based on the situation.

The name Memradar — and the one line

The first name was Promptale. Prompt plus tale. Emotional but doesn't tell you what the tool does.

Memradar made the intent concrete. Mem (memory) plus Radar. A radar sweeping over my memory. It also ties into my blog (radarlog.kr), so the brand carries across. And inside lives Code Report as its own screen. The product name and the feature name each do their own job.

To try it locally, it's one line:

npx memradar

It auto-scans ~/.claude/projects/ and ~/.codex/sessions/, opens the browser into the dashboard. From there, opening Code Report kicks off the full-screen retrospective. Everything stays local. Nothing gets sent anywhere.

There's a web version too. At memradar.vercel.app you can drag the .claude folder straight in.

The first time I looked at my own logs through this, what hit me was how much more I'd been talking to Claude than I realized. And how focused I was late at night. Seeing it as numbers, it finally clicked: "so this is how I've been living."

That's the point of Code Report. Take the conversations I left behind — and let me look back on them.

"My logs were on my disk the whole time. I just wasn't looking."

내 AI 코딩 연말결산을 직접 만들었다 — Memradar Code Report

김이더 — Fri, 17 Apr 2026 07:54:21 +0000

앱은 memradar.vercel.app에서, 코드는 GitHub에 있다.
더 많은 글은 radarlog.kr에서.

매년 연말이 되면 Cursor가 "올해 너는 몇 시간 코딩했고, 어떤 언어를 얼마나 썼고" 하는 회고를 띄운다. GitHub도 Year in Review를 낸다. Discord도 프로필에 연간 통계가 뜬다.

숫자는 내 것인데, 페이지를 넘기다 보면 "아 나 이랬구나" 하고 웃게 된다. 이게 연말결산의 힘이다. 정보 전달이 아니라 한 장면에 한 숫자씩 꺼내놓는 의식이다.

나도 그게 필요했다. Claude Code와 Codex가 내 홈 디렉터리(~/.claude/projects/, ~/.codex/sessions/)에 매일 JSONL 로그를 쌓는데, 정작 그 안을 들여다본 적이 없다. 내 대화가 내 디스크에 있는데, 내가 못 읽는다.

Memradar는 그 JSONL을 회고로 바꿔주는 로컬 툴이다. npx memradar 한 줄이면 브라우저가 뜨고, 대시보드와 함께 풀스크린 슬라이드 회고 — Code Report가 같이 들어있다.

이 글은 Memradar를 만들면서 집요하게 매달린 디테일에 대한 기록이다. 아이디어는 단순하지만, 사람이 "재밌다"고 느끼게 만드는 데는 결정 하나하나가 다 필요했다.

Wrapped가 아니라 Code Report여야 했다

처음 풀스크린 회고 기능을 만들면서 내부적으로 "Wrapped"라고 불렀다. 코드 폴더 이름도 src/components/wrapped/로 시작했다. Spotify Wrapped가 레퍼런스였으니까 자연스럽게 그렇게 됐다.

근데 제품 이름으로 Wrapped를 그대로 쓰면 문제가 있다. 남의 브랜드 용어를 빌려 쓰는 셈이다. 또 Wrapped라는 말은 "연말결산"이라는 범주로만 읽힌다.

이름을 다시 잡으면서 docs/UI-UX-PRINCIPLES.md에 이런 원칙을 못 박았다.

9. 대시보드와 Code Report는 같은 제품의 다른 무드다
- 대시보드는 정보 탐색, Code Report는 감정적 회고와 공유를 담당한다.
- Code Report는 독립 팔레트와 전용 타이포,
  풀스크린 내러티브를 사용한다.

같은 데이터를 다루지만, 대시보드는 "분석 툴"처럼 침착해야 하고 Code Report는 "회고 경험"처럼 감정선을 강화해도 된다고 적었다. 두 화면이 무드에서 섞이지 않도록 팔레트와 타이포그래피를 분리했다.

그래서 이름이 Code Report가 됐다. 코드(내 AI 코딩 로그)에 대한 리포트. 연말결산이라는 고정된 시점이 아니라 언제 열어도 회고가 되는 화면이라는 걸 이름에 담고 싶었다. Wrapped는 1년에 한 번이지만, Code Report는 내가 필요할 때마다 열 수 있다.

코드상의 폴더명(wrapped/)은 그대로 뒀다. 내부 변수명까지 갈아엎는 건 리팩터링 리스크가 크고, 외부에 노출되는 제품 이름만 Code Report로 고정했다. 이 구분 자체가 "같은 기능의 안과 밖을 다르게 부를 수 있다"는 작은 교훈이다.

왜 대시보드만으로는 부족했나

당연히 대시보드부터 만들었다. 히트맵, 시간대별 차트, 워드클라우드, 세션 브라우저. 모든 지표가 한 화면에 다 있다.

완성하고 내 로그를 띄워봤다. 숫자가 다 있었다. 근데 아무 감정도 안 들었다.

이게 대시보드의 한계다. 정보는 많은데, 보는 사람은 "음 그렇구나" 하고 창을 닫는다. 같은 데이터로 Cursor 연말결산이 사람들을 웃게 만드는 건, 한 화면에 한 숫자만 올라와서다. 그 숫자를 둘러싼 공간이 다 비어있기 때문이다.

Code Report의 구성 원칙을 "한 장면 한 메시지"로 잡았다. docs/UI-UX-PRINCIPLES.md에 이렇게 적어뒀다.

- 여기서는 "대시보드 규칙"보다 "한 장면 한 메시지"가 더 중요하다.
- 풀스크린, 강한 여백, 큰 타이포,
  전용 다크 스토리 팔레트를 사용한다.
- 마지막은 공유 가능 이미지로 닫는다.
  즉, 감정선의 끝이 곧 행동 유도여야 한다.

슬라이드는 여덟 장이다. 인트로에 첫 세션 날짜, 다음은 총 프롬프트 수, 자주 쓴 모델, 코딩 시간대, 자주 부른 툴 순위, 성격 유형, 사용량, 마지막에 공유 카드.

각 슬라이드엔 숫자 하나 또는 메시지 하나만 올라간다. "내가 Claude를 3,200번 불렀다"는 걸 대시보드 구석에서 숫자로 보는 것과, 빈 슬라이드에 3,200까지 카운트업으로 올라오는 걸 보는 건 완전히 다르다. 같은 데이터인데.

이게 첫 번째 집착이었다. 숫자를 어떻게 "느끼게" 만들까.

성격 유형을 4개에서 8개로 갈아엎은 이유

Code Report의 하이라이트는 성격 유형 슬라이드다. Architect, Speed Runner, Explorer, Night Sage. 네 개로 시작했다.

근데 이틀 돌려보니 전부 "어 나 저거 아닌데" 소리가 나왔다. MBTI가 재밌는 건 네 축이 각각 독립적으로 계산되기 때문이다. "I이면서 N이면서 F이면서 P"라서 INFP가 나온다. 축이 하나(4타입 중 하나)면 거칠 수밖에 없다.

그래서 3축 시스템으로 다시 짰다.

style × scope × rhythm
 ↓       ↓        ↓
꼼꼼함   깊이     속도
  8 조합

이렇게 조합하면 "심해 잠수부", "번개 해결사", "카오스 크리에이터" 같은 이름이 나온다. 각 축이 따로 계산되니까 결과가 더 개인적으로 느껴진다.

이거 짜는 데 반나절 날렸다. 4타입이 틀린 게 아니라, 결과를 읽는 순간의 느낌이 약했다. 이게 Code Report에서 제일 중요한 지점이다. 숫자가 맞고 틀리고 전에, 읽는 사람이 "오 이거 나네" 하고 웃어야 한다.

성격 계산 로직도 실제 데이터로 계속 튜닝했다. 내 로그에 돌려봤을 때 나한테 맞는 결과가 나오는지, 친구 로그에 돌렸을 때 친구한테 맞는 결과가 나오는지. 임계값 하나를 바꾸면 결과가 흔들린다. 이걸 수십 번 맞췄다.

마지막 슬라이드에 2.5초를 박은 이야기

커밋 제목: Fix last slide dashboard prompt timing.

무슨 문제였냐면, 공유 카드 슬라이드에서 화면을 탭하면 "대시보드로 넘어가시겠습니까?" 모달이 뜨게 해뒀다. 근데 Code Report를 쭉 넘기던 손가락 리듬으로 마지막 슬라이드에서도 무심코 탭해서, 공유 카드를 보기도 전에 모달이 떠버린다.

대부분은 "괜찮지 뭐" 하고 넘어갈 디테일이다. 근데 한 번 걸리니까 계속 신경 쓰였다. 그래서 고쳤다.

useEffect(() => {
  setDashboardPromptReady(false)
  if (slideIndex !== lastSlideIndex) return

  const timer = window.setTimeout(() => {
    setDashboardPromptReady(true)
  }, 2500)

  return () => window.clearTimeout(timer)
}, [lastSlideIndex, slideIndex])

마지막 슬라이드에 들어가면 2.5초 동안 dashboardPromptReady가 false다. 그동안 탭해도 모달은 안 뜬다. 심지어 커서도 default로 바꿨다. "아직 누를 수 없다"는 걸 마우스 커서로도 알려준다.

이거 쓰면서 스스로도 생각했다. 이게 필요한 최적화일까. 사용자는 눈치 못 챌 것 같은데.

근데 필요한 건 맞다. 게임에서 컷씬이 끝난 직후 입력을 몇 프레임 막는 거랑 똑같다. 플레이어가 "스킵 버튼 누르는 리듬"으로 다음 장면에서 뭔가를 잘못 누르는 걸 막는다. 이런 건 걸리면 기분이 나쁘고, 막아두면 아무도 모른다. 근데 모르는 게 맞다.

테마가 20개인 이유

Memradar에는 테마가 20개 있다. 배경 4종(Dark, AMOLED, Light, Warm), 액센트 색 5종(Indigo, Violet, Teal, Rose, Amber). 4 × 5 = 20.

왜 이렇게 많냐. 이것도 "느낌"의 문제다.

Code Report를 넘기는 경험은 사적이다. 내 코딩 로그를 내가 보는 거다. 남이 만든 다크 테마를 그냥 쓰는 게 아니라, 내 분위기를 내가 고르고 싶다. Cursor 연말결산이 보라색 하나로 고정되어 있는 것과 다르게, 내가 만드는 건 내가 고를 수 있어야 한다.

배경 네 개로 나눈 것도 계산된 거다. Dark는 기본, AMOLED는 OLED 화면용 진짜 검정, Light는 밝은 카페용, Warm은 저녁 무드. 상황이 다르면 고르는 게 다르다.

폰트도 Noto Sans KR + Noto Serif KR로 고정했다. 한글이 엄청 많이 들어가기 때문에, 한글이 예쁘게 나오는 게 첫째다. 영어 폰트 안 썼다. 내가 주로 한국어로 코딩해서 프롬프트도 한국어 비중이 높고, 그걸 읽는 Code Report도 한국어가 예뻐야 한다.

localStorage에 저장해서 다음번에도 고른 테마가 유지된다. 이건 당연한데, 당연한 걸 빼먹으면 바로 티난다.

히트맵 한 개에 들어간 결정들

GitHub 스타일 일별 활동 히트맵. 이 위젯 하나에 들어간 결정이 이 정도다.

첫째, 반응형 셀 크기. 창 너비에 따라 셀이 자동으로 커지고 작아진다. ResizeObserver로 컨테이너 크기를 추적하다가, 셀 크기를 다시 계산해서 리렌더한다. 처음엔 고정 크기로 뒀는데, 화면 작은 노트북에서 히트맵이 잘려서 다시 짰다.

둘째, 클릭해서 날짜 선택. 셀을 클릭하면 그날 세션 요약이 사이드에 뜬다. 처음엔 hover만 있었는데, hover는 모바일에서 안 되고, 클릭해서 "고정"할 방법이 있으면 더 자세히 볼 수 있다.

셋째, streak 카운터. "연속 며칠 코딩했는지." 이걸 어디에 넣을까 고민하다가 히트맵 옆 사이드바에 박았다. Duolingo처럼 게임화하려는 게 아니라, 히트맵 보면서 자연스럽게 "아 나 이번 달 15일 연속이네" 하고 알게 되는 정도.

넷째, 요일 패턴. 월화수목금토일 중 어느 요일에 가장 많이 코딩했는지 사이드에 작게. 이건 넣을까 말까 고민했는데, 내가 주말에 생각보다 많이 코딩한다는 걸 이걸 보고 알았다. 이런 게 Code Report의 맛이다. 몰랐던 내 패턴이 보이는 거.

위젯 하나에 결정이 네 개. 히트맵뿐 아니라 워드클라우드, 시간대별 차트, 토큰 차트 다 이렇게 만들어졌다.

DropZone wording을 몇 번 바꿨는지

가장 집요했던 건 랜딩 페이지 DropZone이다. 커밋 로그에 DropZone wording 수정만 네 번 있다.

- Replace copy-paste wording with Ctrl+C/V shortcut
- Use Ctrl+C/V/Enter wording consistently in DropZone
- Allow .claude/.codex root folder drop and add install guide link
- Fix DropZone wording: shorten Ctrl+C/V label

하나씩 뜯어보면 이렇다. 처음엔 "복사해서 붙여넣기"였다. 그러다가 Ctrl+C / Ctrl+V로 바꿨다. 왜냐면 한국어로 "복사해서 붙여넣기"는 길고, 영어로 "copy and paste"도 장황하다. 숏컷 두 개 나란히 보여주면 0.5초 만에 "아 저거"가 꽂힌다.

두 번째 변경은 "일관성"이었다. 어떤 곳엔 "Ctrl+C/V" 있는데 다른 곳엔 "복사/붙여넣기"가 남아있었다. 용어가 섞이면 불편하다.

세 번째, .claude 폴더 자체 드롭 허용. 원래는 .claude/projects/ 안쪽을 드롭해야 파싱됐다. 근데 사용자 입장에선 .claude를 홈에서 드래그하는 게 자연스럽다. 그래서 드롭 핸들러가 .claude 안의 projects 자식을 자동으로 찾아 들어가도록 고쳤다.

if (droppedFolder.name === '.claude') {
  const projectsDir = findChild(droppedFolder, 'projects')
  if (projectsDir) return scanDir(projectsDir)
}

네 번째는 다시 wording. "Ctrl+C/V" 레이블이 너무 길어서 앞에 붙어있던 중복 prefix를 뺐다.

같은 컴포넌트에서 표현 하나 가지고 네 번 고쳤다. 과하다 싶지만, DropZone은 사용자가 제일 먼저 보는 화면이다. 여기서 5초 안에 뭘 해야 하는지가 안 꽂히면, 나머지가 아무리 예뻐도 소용없다.

한영 동시 지원이 생각보다 큰 결정이었다

i18n (ko/en), theme presets, and hash routing. 한 커밋에 세 가지가 같이 들어갔다.

한영 동시 지원을 넣은 건 블로그를 한영 동시 발행하는 나 자신을 위해서다. 근데 더 큰 이유가 있다. Code Report에 들어가는 문구는 감정적이다. "너는 Night Owl이야", "이번 달에만 3,200번 대화했어" 같은 문장들. 이게 영어로만 있으면, 한국어 유저는 "재미"가 반감된다.

// src/i18n.tsx
const translations = {
  ko: {
    'personality.nightSage': '새벽의 현자',
    'personality.speedRunner': '번개 해결사',
    // ...
  },
  en: {
    'personality.nightSage': 'Night Sage',
    'personality.speedRunner': 'Lightning Fixer',
    // ...
  }
}

Night Sage를 "밤의 현자"가 아니라 "새벽의 현자"로 번역한 것도, "Lightning Fixer"를 "번개 해결사"로 맞춘 것도, 그냥 직역이 아니라 한국어로 읽었을 때 같은 온도가 되도록 골랐다. 성격 유형 여덟 개 이름을 한영 각각 다시 쓰는 데 몇 시간 걸렸다.

해시 라우팅을 같이 넣은 이유는, Code Report 링크를 공유할 수 있게 만들려면 URL에 상태가 담겨야 하기 때문이다. #wrapped/5 같은 식으로 슬라이드 번호가 URL에 들어간다. 새로고침해도 돌아올 수 있다.

단일 HTML을 버린 순간

기술적인 결정 중에 제일 컸던 건 CLI 구조를 단일 HTML에서 로컬 서버로 바꾼 거다.

원래는 CLI가 JSONL을 다 파싱해서, JSON으로 직렬화하고, <script> 태그에 박아서 거대한 단일 HTML 파일을 만들었다. 오프라인에서도 돌고 서버도 필요 없고, 깔끔했다.

근데 내 .claude/projects/가 커지면서 그 HTML이 18MB를 넘겼다. 브라우저가 JSON.parse 하는 동안 4~5초 멈춘다. Code Report 시작하기도 전에 끊긴다.

그래서 로컬 서버(http.createServer로 3939 포트) + /api/sessions 스트리밍으로 뒤집었다. 브라우저는 0.6MB짜리 앱 번들만 먼저 받고, 세션은 10개씩 서버에서 당겨온다. 세션 본문은 클릭했을 때만.

이건 UE5에서 맵 전체 로딩 vs 레벨 스트리밍이랑 똑같다. 예전엔 월드 전체를 RAM에 올렸다. 지금 오픈월드는 플레이어 주변만 스트리밍으로 로드한다. 디스크엔 다 있는데, 메모리엔 필요한 만큼만. 정확히 그 이동이다.

단일 HTML 모드는 --static 플래그로 살려뒀다. 세션이 몇 개 없을 땐 여전히 단일 HTML이 제일 간단하다. 상황에 따라 고를 수 있게.

Memradar라는 이름, 그리고 한 줄

처음 이름은 Promptale이었다. 프롬프트 + 이야기. 감성적이긴 한데 뭐 하는 툴인지 안 읽힌다.

Memradar로 바꾸고 나니 의미가 명확해졌다. Mem(memory) + Radar. 내 기억(로그)을 레이더로 훑는다. 내 블로그(radarlog.kr)와도 엮여서 브랜드가 이어진다. 그리고 그 안에 Code Report라는 화면이 들어있다. 제품명과 기능명이 각자 자기 역할을 한다.

로컬에서 돌려보고 싶으면 한 줄이면 된다.

npx memradar

자동으로 ~/.claude/projects/랑 ~/.codex/sessions/를 스캔하고, 브라우저에서 대시보드를 띄워준다. 거기서 Code Report를 열면 풀스크린 회고가 시작된다. 데이터는 전부 로컬에서만 처리된다. 서버에 뭐 안 보낸다.

웹 버전도 있다. memradar.vercel.app에서 .claude 폴더를 직접 드래그해도 된다.

내 로그를 이걸로 처음 봤을 때 들었던 생각은, Claude한테 생각보다 훨씬 많이 말했다는 거였다. 그리고 새벽에 훨씬 집중했다는 것도. 숫자로 보고 나서야 "아 내가 이렇게 살고 있구나"가 와닿았다.

이게 Code Report의 목적이다. 내가 남긴 대화를, 내가 회고할 수 있게.

"내 로그는 내 디스크에 있었다. 내가 안 봤을 뿐이다."

Happy — The Open-Source App That Puts Claude Code in Your Pocket

김이더 — Wed, 15 Apr 2026 11:02:45 +0000

Code on GitHub. Live app at happy.engineering.
More posts at radarlog.kr.

I kicked off a refactoring job with Claude Code and went to grab lunch. Came back 20 minutes later. The terminal was sitting on a permission prompt. It had been doing nothing the entire time, just waiting for me to hit "yes."

If you use Claude Code, you've been there.

Happy solves this head-on. It's an open-source CLI wrapper that lets you control Claude Code from your phone. Over 17,000 GitHub stars, MIT licensed. But the first question everyone asks is the same one I had.

"Isn't this just Claude Cowork Dispatch?"

Short answer: completely different tools.

Dispatch vs Dispatch vs Happy

The word "Dispatch" shows up in at least three different contexts, and it's confusing. Let me untangle it.

First, there's Claude Cowork Dispatch — Anthropic's official feature announced in March 2026. You scan a QR code on your desktop, and your phone becomes a remote control for an AI agent. It connects with 38+ apps and is built for non-coding tasks: documents, emails, productivity tools. If you don't write code, this is your tool.

Then there's bassimeledath/dispatch — an open-source Claude Code skill. Type /dispatch and your main session becomes an orchestrator. Actual work fans out to background workers, each with a fresh context window. It's about making context windows 10x more efficient. Nothing to do with mobile.

And then there's Happy. It wraps Claude Code's CLI and lets you remote-control the terminal session itself from a mobile app or web browser.

Think of it this way. Cowork Dispatch is hiring an assistant. bassimeledath/dispatch is a team lead distributing tasks. Happy is putting your terminal in your pocket.

Zero overlap.

What Happy Actually Does

Installation:

npm install -g happy-coder

That's it. Now type happy instead of claude.

# Before
claude

# After
happy

# Works with Codex too
happy codex

Happy wraps the Claude Code process. Claude Code still runs locally — nothing changes there. But Happy encrypts the session's I/O and relays it through a server to your phone.

The architecture is three pieces. A CLI program on your machine starts and monitors Claude Code. A relay server passes encrypted blobs between devices. A mobile app decrypts and displays everything. The server only moves encrypted data around — it literally cannot read your code.

The encryption is solid. X25519 keypairs, ECDH key exchange, AES-256-GCM message encryption, ephemeral session keys for forward secrecy. If you're working with company code, this matters. Your code never exists in plaintext on anyone else's infrastructure.

Voice Coding — Skeptical Until I Wasn't

"Code with your voice" sounds like a gimmick. I thought so too. Dictating for (int i = 0; i < n; i++) into a microphone is obviously terrible.

But that's not what Happy's voice agent does.

It does exactly one thing: translates your rambling into structured requests that Claude Code can understand. You're not dictating code. You're saying things like "refactor the auth module to separate concerns" while walking your dog.

The docs make a good point. Sitting at your desk at 100% efficiency for zero hours beats walking around at 50% efficiency for three hours — wait, no. It's the other way around. 50% of three otherwise-wasted hours is infinitely more than 0% of them.

The voice agent prompts are customizable inside the app. No forking, no rebuilding. You can tune how it interprets your speech directly. That's practical.

For someone like me who runs UE5 builds that take ages, this means I can voice-control a Claude Code agent on a side project while waiting for shaders to compile. Build wait time just disappears.

Parallel Sessions and Why They Matter

Happy can run multiple Claude Code instances simultaneously. Frontend refactoring in one session, API test generation in another, infra work in a third. Switch between them on your phone.

Why this matters: Claude Code Max gives you unlimited tokens, but context contamination is real. Running five different tasks in one session means by task three, Claude is losing track of task one. Separate sessions keep each task in a clean context.

The bassimeledath/dispatch skill focuses on using context windows efficiently within a single session. Happy's multi-session approach physically isolates sessions so they don't interfere. Different strategy, complementary goals.

In practice, it looks like this. Session A runs a component refactoring. Session B generates missing tests. Both show up on your phone. A permission request comes in on Session A — you approve it while sipping coffee. Session B finishes — you review the output. Neither session knows the other exists.

Why Happy Won Over the Alternatives

Happy wasn't the first attempt at mobile Claude Code. CodeRemote, YoloCode, Omnara, Claudia, Conductor, Tonkotsu — they all tried. One user on X posted they'd tried every single one and Happy beat them all.

Three reasons, I think.

Open source means transparency. You can audit every line. No telemetry, no tracking. When you're piping company code through a tool, this is decisive. A closed-source app asking for terminal access to your codebase is a non-starter for most teams.

Self-hosting is straightforward. One Docker Compose file gives you PostgreSQL + Redis + Happy Server in about three minutes. Run it behind your company firewall and the relay server never touches the public internet.

services:
  happy-server:
    image: happy-server:latest
    ports:
      - "3000:3000"
    environment:
      - DATABASE_URL=postgresql://postgres:postgres@postgres:5432/happy-server
      - REDIS_URL=redis://redis:6379
      - SEED=your-secure-seed
    depends_on:
      - postgres
      - redis

And it's free. This alone beats everything else. When a free open-source tool is better than paid alternatives, the market corrects fast.

What's Still Missing

Being honest: Happy isn't perfect.

Cowork Dispatch integrates natively with 38+ apps. Happy is purely CLI terminal remote control. Sending Slack messages or editing Google Docs isn't what Happy does. That's Cowork Dispatch's territory.

The last official release (v1.4.0) was September 2025. The npm package rename and monorepo consolidation happened after that, but without formal release tags. Development continues on main, which is great for velocity but can be shaky for stability.

And there are edge cases. GitHub issues mention stdin-related crashes in background mode. The project is active — 1,500+ commits, 44 contributors — but it's community-driven, not backed by a company.

So What Should You Actually Use

Here's the decision tree.

You don't write code and want AI to automate documents, emails, and productivity tools → Claude Cowork Dispatch.

You want to maximize context efficiency within a single Claude Code session with parallel background workers → bassimeledath/dispatch skill.

You want to control your Claude Code terminal from anywhere with your phone → Happy.

These three aren't competing. They're different layers. You could even combine them: use Happy to access Claude Code from your phone, then run /dispatch inside that session to fan out work to background agents.

The real bottleneck in the AI coding agent era isn't model performance. It's whether you can control an already-running agent when you're not at your desk. Happy solves that with one CLI wrapper.

"We didn't need a better model. We needed a better remote."

Claude Code를 폰에 들고 다니게 해주는 오픈소스, Happy

김이더 — Wed, 15 Apr 2026 11:02:44 +0000

코드는 GitHub에, 앱은 happy.engineering에서 볼 수 있다.
더 많은 글은 radarlog.kr에서.

Claude Code로 리팩토링을 돌려놓고 밥 먹으러 나갔다. 20분 뒤 폰을 보니 아무 알림도 없다. 돌아와서 터미널을 열었더니 permission 요청에서 멈춰 있었다. 20분 동안 아무것도 안 하고 기다리고 있었던 거다.

이 상황, Claude Code 쓰는 사람이면 다 겪어봤을 거다.

Happy는 이 문제를 정면으로 해결한다. Claude Code를 폰에서 제어할 수 있게 해주는 오픈소스 CLI 래퍼다. GitHub 스타 17,000개 넘고, MIT 라이선스. 근데 처음 봤을 때 드는 의문이 있다.

"이거 Claude Cowork Dispatch 아니야?"

결론부터 말하면, 전혀 다른 물건이다.

Dispatch는 뭐고 Happy는 뭔데

"Dispatch"라는 이름이 붙은 게 최소 세 가지다. 먼저 정리하고 가자.

첫 번째는 Claude Cowork Dispatch. Anthropic이 2026년 3월에 공개한 공식 기능이다. 데스크톱에서 QR 코드를 스캔하면 폰에서 AI 에이전트를 제어할 수 있다. 38개 이상의 앱과 연동되고, 파일 검색이나 요약 같은 비코딩 작업에 최적화되어 있다. 코드를 쓰는 게 아니라 문서, 이메일, 생산성 도구를 다루는 사람들을 위한 물건이다.

두 번째는 bassimeledath/dispatch. Claude Code의 스킬로 동작하는 오픈소스 프로젝트다. /dispatch를 치면 메인 세션이 오케스트레이터가 되고, 실제 작업은 백그라운드 워커들이 각자 독립된 컨텍스트 윈도우에서 병렬로 실행한다. 컨텍스트 윈도우를 10배 효율적으로 쓰는 게 핵심이다. 모바일 제어와는 관계 없다.

세 번째가 Happy. Claude Code의 CLI를 래핑해서 모바일/웹에서 터미널 세션 자체를 원격 제어하는 도구다.

비유하자면 이렇다. Cowork Dispatch는 비서한테 일을 시키는 거다. bassimeledath/dispatch는 팀장이 팀원들한테 업무를 나누는 거다. Happy는 내가 쓰던 터미널을 주머니에 넣고 돌아다니는 거다.

겹치는 영역이 전혀 없다.

그래서 Happy가 뭘 하는 건데

설치부터 보자.

npm install -g happy-coder

이게 끝이다. 이제부터 claude 대신 happy를 치면 된다.

# 기존
claude

# Happy
happy

# Codex도 된다
happy codex

Happy는 Claude Code 프로세스를 감싸는 래퍼다. 로컬에서 Claude Code가 돌아가는 건 똑같다. 다만 그 세션의 입출력을 암호화해서 릴레이 서버로 보내고, 폰 앱에서 받아 보여준다.

아키텍처가 세 조각이다. CLI 프로그램이 로컬에서 Claude Code를 시작하고 모니터링한다. 릴레이 서버가 암호화된 블롭을 양쪽으로 전달한다. 모바일 앱이 그걸 복호화해서 보여준다. 서버는 E2E 암호화된 데이터를 중계만 하기 때문에 서버 쪽에서는 코드를 읽을 수 없다.

X25519 키페어로 ECDH 키 교환을 하고, AES-256-GCM으로 메시지를 암호화한다. 세션별 임시 키로 전방 비밀성까지 보장한다. 회사 코드를 다루는 사람 입장에서 이 부분이 중요하다. 내 코드가 어딘가에 평문으로 저장되는 일은 없다.

보이스 코딩이 진짜 쓸만한가

처음에 "음성으로 코딩"이라는 말을 듣고 솔직히 회의적이었다. 음성 인식이 아무리 좋아져도, 코드를 말로 짜는 건 비효율적이니까.

근데 Happy의 보이스 코딩은 그런 게 아니다.

voice agent가 하는 일은 딱 하나다. 내가 중얼거리는 말을 Claude Code가 이해할 수 있는 구조화된 요청으로 변환한다. 코드를 말로 짜는 게 아니라, "auth 모듈의 리팩토링 방향을 잡아줘"를 걸으면서 말하는 거다.

공식 문서에서 재밌는 관점을 하나 꺼냈다. 키보드 앞에 앉아서 100% 효율로 일하는 것과, 산책하면서 50% 효율로 일하는 것을 비교하면, 후자의 "3시간 × 50%"가 전자의 "0시간 × 100%"보다 낫다는 거다. 어차피 쉬는 시간에 아무것도 안 하고 있었을 테니.

프롬프트도 앱 안에서 커스터마이징 가능하다. 포크해서 리빌드할 필요 없다. voice agent의 행동 방식을 직접 조절할 수 있다. 이건 실용적이다.

나 같은 경우를 생각해보면, UE5 빌드를 돌려놓고 기다리는 시간에 사이드프로젝트의 Claude Code 에이전트를 음성으로 제어할 수 있다. 빌드 대기 시간이 그냥 사라지는 거다.

멀티 세션 병렬이 의미 있는 이유

Happy는 여러 Claude Code 인스턴스를 동시에 돌릴 수 있다. 프론트엔드, 백엔드, 인프라 작업을 각각 별도 세션으로 띄우고, 폰에서 세션 간 전환하면서 모니터링한다.

이게 왜 중요하냐면, Claude Code Max 구독은 토큰 무제한이지만 동시 실행 세션 수에는 제한이 있다. 여러 작업을 한 세션에서 순차적으로 돌리면 컨텍스트가 오염된다. 세션을 분리하면 각 작업이 깨끗한 컨텍스트에서 독립적으로 돌아간다.

앞서 언급한 bassimeledath/dispatch가 "컨텍스트 윈도우를 효율적으로 쓰는 것"에 집중한다면, Happy의 멀티 세션은 "물리적으로 세션을 분리해서 서로 간섭 없이 돌리는 것"에 집중한다. 방향이 다르다.

실전에서 보면 이런 식이다. 세션 A에서는 프론트엔드 컴포넌트 리팩토링이 돌아간다. 세션 B에서는 API 엔드포인트 테스트 생성이 돌아간다. 폰에서 두 세션 다 보면서, 권한 요청이 오면 바로 승인한다. 커피 마시면서.

기존에 뭐가 있었고 왜 Happy가 이겼나

모바일에서 Claude Code를 쓰려는 시도가 Happy만 있었던 건 아니다. CodeRemote, YoloCode, Omnara, Claudia, Conductor, Tonkotsu 같은 프로젝트들이 있었다. X에서 한 사용자가 "다 써봤는데 Happy가 이겼다"고 올린 글이 있다.

왜 이겼을까. 추측해보면 세 가지다.

오픈소스라서 투명하다. 내 코드가 어디로 가는지 직접 확인할 수 있다. 텔레메트리도 트래킹도 없다. 보안이 중요한 회사 코드를 다루는 개발자한테 이건 결정적이다. 클로즈드 소스 앱에 회사 코드를 흘려보내는 건 아무리 편해도 쓰기 어렵다.

셀프호스팅이 가능하다. Docker Compose 파일 하나로 PostgreSQL + Redis + Happy Server를 3분 만에 띄울 수 있다. 회사 방화벽 안에서 돌리면 릴레이 서버조차 외부에 노출되지 않는다.

# docker-compose.yml
services:
  happy-server:
    image: happy-server:latest
    ports:
      - "3000:3000"
    environment:
      - DATABASE_URL=postgresql://postgres:postgres@postgres:5432/happy-server
      - REDIS_URL=redis://redis:6379
      - SEED=your-secure-seed
    depends_on:
      - postgres
      - redis

그리고 무료다. 이건 사실 다른 걸 다 이기는 이유다. 같은 기능을 하는 유료 앱이 있는데 무료 오픈소스가 더 잘 만들어져 있으면, 결과는 뻔하다.

근데 아직 부족한 것도 있다

솔직히 써야 할 것 같다. Happy가 만능은 아니다.

Claude Cowork Dispatch가 38개 이상의 앱과 네이티브로 연동되는 것에 비하면, Happy는 순수하게 CLI 터미널의 원격 제어만 한다. Slack에 메시지 보내거나 Google Docs를 편집하는 건 Happy의 영역이 아니다. 그건 Cowork Dispatch의 영역이다.

릴리즈 노트를 보면 v1.4.0이 마지막 공식 릴리즈고 2025년 9월이다. npm 패키지명이 happy-coder에서 happy로 바뀌고 모노레포로 통합된 건 그 이후인데, 정식 릴리즈 태깅 없이 main 브랜치에서 계속 개발이 진행되는 패턴이다. 안정성 측면에서는 좀 불안할 수 있다.

그리고 Cowork Dispatch의 성공률이 복잡한 작업에서 50% 정도라는 외부 테스트 결과가 있었는데, Happy도 비슷한 엣지 케이스가 있을 수 있다. stdin 관련 이슈로 백그라운드 모드에서 크래시가 나는 경우가 GitHub 이슈에 보고되어 있다.

결국 뭘 써야 하나

정리하면 이렇다.

코드를 안 쓰고 문서/이메일/생산성 도구를 AI로 자동화하고 싶다면 → Claude Cowork Dispatch.

하나의 Claude Code 세션에서 컨텍스트를 효율적으로 쓰면서 병렬 작업을 하고 싶다면 → bassimeledath/dispatch 스킬.

Claude Code 터미널 자체를 어디서든 제어하고 싶다면 → Happy.

이 셋은 경쟁 관계가 아니라 레이어가 다르다. 같이 쓸 수도 있다. Happy로 모바일에서 Claude Code 세션에 접속해서, 그 안에서 /dispatch 스킬을 실행하는 것도 가능하다.

AI 코딩 에이전트 시대에 진짜 병목은 모델 성능이 아니다. 이미 돌아가고 있는 에이전트를 내가 자리에 없을 때도 제어할 수 있느냐, 그게 병목이다. Happy는 그 병목을 CLI 래퍼 하나로 풀었다.

"더 좋은 모델이 아니라 더 나은 리모컨이 필요했다."

To Teach AI How to Remember, First Teach It How to Forget 2/2

김이더 — Tue, 14 Apr 2026 08:05:32 +0000

Code on GitHub. Paper on arXiv.
More posts at radarlog.kr.

Part 1 covered MemoryBank's architecture and how to use it. This part cracks open the engine. One formula. R = e^(−t/S). How it works, what happens when S changes, where to set the threshold — all in numbers.

Breaking Down the Formula

R = e^(−t/S)

Three variables. R is memory retention (between 0 and 1), t is time elapsed since learning, S is memory strength. e is Euler's number, approximately 2.71828.

This is an exponential decay model. Same structure as the formula for radioactive decay in physics. Values drop dramatically at first, then taper off gradually.

If you're a game developer, this pattern is everywhere. Alpha decay on particle effects, sound fade-outs, damage-over-time tick reduction. All exponential decay. The forgetting curve is the same math. Just applied to memories instead of damage.

Here's the key insight. What really matters in this formula is the t/S ratio. Not the individual values of t and S, but how they relate. When t/S equals 1, R is about 0.368 (36.8%). At t/S = 2, R drops to about 0.135 (13.5%). At t/S = 3, it's roughly 0.050 (5.0%). Each unit increase in the ratio cuts retention by roughly 1/e (about 36.8%).

import math

# Retention by t/S ratio
for ratio in [0, 0.5, 1, 2, 3, 5]:
    R = math.exp(-ratio)
    print(f"t/S = {ratio:.1f} → R = {R:.4f} ({R*100:.1f}%)")

# t/S = 0.0 → R = 1.0000 (100.0%)
# t/S = 0.5 → R = 0.6065 (60.7%)
# t/S = 1.0 → R = 0.3679 (36.8%)
# t/S = 2.0 → R = 0.1353 (13.5%)
# t/S = 3.0 → R = 0.0498 (5.0%)
# t/S = 5.0 → R = 0.0067 (0.7%)

The implication is clear. A larger S means the t/S ratio stays small for longer, so R stays high. S flattens the curve.

What Happens When S Increases

In MemoryBank, S is an integer. First mention: 1. One recall: 2. Another recall: 3. Simple. Let's see exactly what this simple change does to the curve.

Using days as the time unit for t. What's the retention of an S=1 memory after one day?

# Retention by S value and elapsed days
for S in [1, 2, 3, 5]:
    print(f"\n--- S = {S} ---")
    for t_days in [0, 0.5, 1, 2, 3, 5, 7]:
        R = math.exp(-t_days / S)
        print(f"  t={t_days}d → R = {R:.4f} ({R*100:.1f}%)")

Here's the picture:

S=1:  Day 0 100% → Day 1 36.8% → Day 2 13.5% → Day 3 5.0% → Day 7 0.1%
S=2:  Day 0 100% → Day 1 60.7% → Day 2 36.8% → Day 3 22.3% → Day 7 3.0%
S=3:  Day 0 100% → Day 1 71.7% → Day 2 51.3% → Day 3 36.8% → Day 7 9.7%
S=5:  Day 0 100% → Day 1 81.9% → Day 2 67.0% → Day 3 54.9% → Day 7 24.7%

An S=1 memory crashes to 36.8% after one day. After a week, 0.1%. Effectively dead.

At S=2, 60.7% survives after the same day. At S=5, 81.9% after a day. Even after a full week, 24.7% remains.

Since S goes up by 1 with each recall, a memory recalled 3 times (S=4) retains 17.4% after a week. Recalled 5 times (S=6) retains 31.1% after a week. The curve visibly flattens.

In game terms, think "buff duration increase from stacking." Apply the same buff multiple times, duration gets longer each time. More stacks, longer effect. MemoryBank's S works exactly like that.

The Mathematical Impact of Resetting t

The S increase matters, but resetting t to 0 actually creates the more dramatic effect.

Picture this scenario. A memory sits at S=1, and 3 days have passed. R equals e^(−3/1) = 0.050, or 5.0%. Nearly gone.

At this moment, the memory gets recalled in conversation. MemoryBank bumps S to 2 and resets t to 0. Instantly, R becomes e^(−0/2) = 1.000 — back to 100%.

Before recall:  S=1, t=3 days → R = 5.0%  (nearly dead)
After recall:   S=2, t=0 days → R = 100%  (fully revived)

5% to 100%. That jump is everything.

One more thing. The revived memory is stronger than before. S went from 1 to 2, so the next time 3 days pass, R won't be 5.0% — it'll be 22.3%. In its first life, 3 days was almost fatal. In its second life, 3 days is perfectly survivable.

Repeat this pattern:

# Recall simulation
# Memory recalled every 3 days
memory = {'S': 1, 't': 0}

for cycle in range(5):
    # 3 days pass
    memory['t'] = 3
    R_before = math.exp(-memory['t'] / memory['S'])

    # Recall occurs
    memory['S'] += 1
    memory['t'] = 0

    print(f"Cycle {cycle+1}: Pre-recall R={R_before:.4f} ({R_before*100:.1f}%)"
          f" → Post-recall S={memory['S']}, R=100%")

# Cycle 1: Pre-recall R=0.0498 (5.0%)  → Post-recall S=2, R=100%
# Cycle 2: Pre-recall R=0.2231 (22.3%) → Post-recall S=3, R=100%
# Cycle 3: Pre-recall R=0.3679 (36.8%) → Post-recall S=4, R=100%
# Cycle 4: Pre-recall R=0.4724 (47.2%) → Post-recall S=5, R=100%
# Cycle 5: Pre-recall R=0.5488 (54.9%) → Post-recall S=6, R=100%

The "pre-recall retention" for a memory recalled every 3 days climbs from 5.0% → 22.3% → 36.8% → 47.2% → 54.9%. Same 3-day gap, but accumulated recalls make the memory increasingly resilient.

This is the spacing effect that Ebbinghaus discovered, naturally embedded in the formula. Repeated review flattens the forgetting curve — that principle, encoded in math.

Where to Set the Threshold

The MemoryBank paper doesn't specify an exact threshold number. But to implement this, you need to decide: "At what R do we delete a memory?"

Reframe the question first. "How many days without recall should it take for a memory to be forgotten?"

The baseline is an S=1 memory (never recalled). How long it survives depends on the threshold.

# Time to forget for S=1, by threshold
for threshold in [0.5, 0.3, 0.1, 0.05, 0.01]:
    # R = e^(-t/S) → t = -S * ln(R)
    t_forget = -1 * math.log(threshold)
    print(f"Threshold {threshold} → S=1 memory dies after {t_forget:.2f} days")

# Threshold 0.5  → S=1 memory dies after 0.69 days (~17 hours)
# Threshold 0.3  → S=1 memory dies after 1.20 days (~29 hours)
# Threshold 0.1  → S=1 memory dies after 2.30 days
# Threshold 0.05 → S=1 memory dies after 3.00 days
# Threshold 0.01 → S=1 memory dies after 4.61 days

Threshold 0.5 means memories vanish in 17 hours. Too aggressive. Yesterday's conversation is already gone today.

Threshold 0.01 means 4.6 days of survival. Too loose. Meaningless chatter lingers for almost 5 days, wasting memory space.

The practical sweet spot is 0.05 to 0.1. S=1 memories naturally disappear within 2–3 days, while S=3+ memories (recalled twice or more) survive over a week.

You can also work backwards. If "important memories must survive at least 7 days" is a requirement, you can reverse-engineer S and the threshold.

# "What minimum S is needed to survive 7 days?"
target_days = 7
threshold = 0.1

# R = e^(-t/S) ≥ threshold
# S ≥ -t / ln(threshold)
S_min = -target_days / math.log(threshold)
print(f"To have R ≥ 0.1 after 7 days, need S ≥ {S_min:.2f}")

# To have R ≥ 0.1 after 7 days, need S ≥ 3.04

S needs to be at least 4 (recalled 3 times) to stay above the 0.1 threshold after 7 days. Tune these parameters to fit your service's characteristics.

Thinking in Half-Lives

The most intuitive metric for exponential decay is the half-life — how long until retention drops to 50%.

# R = e^(-t/S) = 0.5
# t_half = S * ln(2) ≈ S * 0.693

for S in [1, 2, 3, 5, 10]:
    t_half = S * math.log(2)
    print(f"S={S:2d} → half-life = {t_half:.2f} days")

# S= 1 → half-life = 0.69 days (~17 hours)
# S= 2 → half-life = 1.39 days (~33 hours)
# S= 3 → half-life = 2.08 days (~50 hours)
# S= 5 → half-life = 3.47 days
# S=10 → half-life = 6.93 days (~1 week)

Half-life is directly proportional to S. Double S, double the half-life. Linear relationship.

A never-recalled memory (S=1) has a half-life of 17 hours. Half gone before the day is over. A twice-recalled memory (S=3) has a half-life of 2 days. Nine recalls (S=10) gets you nearly a week.

These numbers make MemoryBank's design intent crystal clear. Topics that come up frequently are remembered longer. One-off mentions fade fast. Same pattern as human memory.

The game analogy here is aggro decay. When a player stops dealing damage to a boss, aggro decays over time. Keep hitting, and aggro stays high and builds. Damage dealt is "recall," aggro value is "retention." Same exponential dynamics.

Simulation: The Fate of 5 Memories Over 10 Days

Let's run a real scenario. Five memories, different recall patterns, 10 days.

import math

THRESHOLD = 0.1

memories = {
    "job_change":    {"S": 1, "t": 0, "born": 0, "recalls": []},
    "lunch_menu":    {"S": 1, "t": 0, "born": 0, "recalls": []},
    "UE5_bug":       {"S": 1, "t": 0, "born": 1, "recalls": []},
    "weekend_camp":  {"S": 1, "t": 0, "born": 2, "recalls": []},
    "salary_talk":   {"S": 1, "t": 0, "born": 3, "recalls": []},
}

# Recall schedule (by day)
recall_schedule = {
    "job_change":   [1, 3, 5, 8],    # Frequent recalls
    "lunch_menu":   [],               # Never recalled
    "UE5_bug":      [2, 4],           # Occasional recalls
    "weekend_camp": [3],              # Recalled once
    "salary_talk":  [4, 6, 7, 8, 9], # Very frequent recalls
}

for day in range(11):
    print(f"--- Day {day} ---")
    for name, mem in memories.items():
        if day < mem['born']:
            continue

        last_event = mem['recalls'][-1] if mem['recalls'] else mem['born']
        mem['t'] = day - last_event

        if day in recall_schedule[name]:
            mem['S'] += 1
            mem['recalls'].append(day)
            mem['t'] = 0

        R = math.exp(-mem['t'] / mem['S']) if mem['S'] > 0 else 0
        status = "ALIVE" if R >= THRESHOLD else "DEAD"

        print(f"  {name:14s}: S={mem['S']}, t={mem['t']}d, "
              f"R={R:.3f} ({R*100:.1f}%) [{status}]")
    print()

Key takeaways from the results.

"lunch_menu" was never recalled. S=1, R hits 13.5% by Day 2 and 5.0% by Day 3. Dead at the 0.1 threshold on Day 3.

"job_change" was recalled on Days 1, 3, 5, and 8. By Day 10, S=5, 2 days since last recall. R equals e^(−2/5) = 67.0%. Still healthy.

"salary_talk" was recalled almost daily starting Day 4. By Day 10, S=6, 1 day since last recall. R equals e^(−1/6) = 84.6%. The strongest memory.

"weekend_camp" was recalled exactly once on Day 3. S bumped to 2, but then 7 days of silence. By Day 10, R equals e^(−7/2) = 3.0%. Dead.

Same 10 days, completely different fates based on recall patterns. That's the core mechanism of MemoryBank at work.

Mathematical Limitations of the MemoryBank Model

The formula is clean. The gaps with reality are equally clear.

Linear S increase. S goes up by exactly 1 per recall. But real human memory reinforcement is nonlinear. The first review has the biggest impact, and subsequent reviews show diminishing returns. Something like S_new = S_old + 1/log(S_old + 1) would be more realistic than a flat S_new = S_old + 1.

No emotional weighting. Ebbinghaus's original experiments used meaningless syllables — things like WID and ZOF. He himself acknowledged that meaningful information is forgotten roughly 10 times more slowly. MemoryBank initializes S at 1 for everything — career doubts and lunch choices alike. Emotional significance isn't factored in. An extension could use LLM-based sentiment analysis to assign differential S_init values (say, 1–3) based on emotional intensity.

Ambiguous time units. The paper doesn't specify what unit t uses. Days? Hours? Minutes? Conversation sessions? The curve's shape changes entirely depending on the unit. This is the first parameter to lock down when deploying in production.

Recall detection criteria. "This memory was recalled during conversation" is determined by FAISS search results. Does appearing in top-k count as recall? Or does it need to actually influence the response? The answer changes how frequently S gets incremented. If memories that were retrieved but never used in the response still get their S bumped, memory strength gets overestimated.

These limitations are also expansion directions. The authors stating this is "an exploratory and highly simplified model" means the formula is a starting point, not the final answer. Layer emotional weighting, nonlinear S growth, and context-aware recall detection on top of the base formula, and you get a far more sophisticated memory system.

R = e^(−t/S). One formula explains the birth, reinforcement, and death of memories. Not a complex memory architecture — a 140-year-old psychological principle, transplanted onto LLMs. Simple but effective. And because it's simple, it's extensible.

"The simpler the formula, the stronger it is. Complexity is easy to implement. Simplicity is easy to extend."

AI한테 기억을 가르치려면, 잊는 법부터 가르쳐야 한다 2/2

김이더 — Tue, 14 Apr 2026 08:05:31 +0000

코드는 GitHub에, 논문은 arXiv에서 볼 수 있다.
더 많은 글은 radarlog.kr에서.

1편에서 MemoryBank의 아키텍처와 활용법을 다뤘다. 이번 편은 그 심장부를 열어본다. 수식 하나. R = e^(−t/S). 이 수식이 어떻게 작동하는지, S값이 변하면 곡선이 어떻게 바뀌는지, 임계값은 어디에 잡아야 하는지를 숫자로 파고든다.

수식을 분해한다

R = e^(−t/S)

변수가 세 개다. R은 기억 보유율(0~1 사이), t는 학습 이후 경과 시간, S는 기억 강도. e는 자연상수 2.71828이다.

이 수식은 지수 감쇠(exponential decay) 모델이다. 물리학에서 방사성 붕괴를 표현하는 공식과 구조가 같다. 시간이 지날수록 값이 기하급수적으로 줄어든다. 처음에 급격히 떨어지고, 나중에는 천천히 떨어진다.

게임 개발자라면 익숙한 패턴이다. 파티클의 알파값 감쇠, 사운드의 페이드아웃, 데미지 오버 타임(DoT)의 틱 감소. 전부 지수 감쇠를 기반으로 한다. 망각 곡선도 같은 수학이다. 다만 대상이 데미지가 아니라 기억일 뿐이다.

핵심을 짚자. 이 수식에서 진짜 중요한 건 t/S 비율이다. t와 S가 각각 얼마인지보다, 둘의 비율이 R을 결정한다. t/S가 1이면 R은 약 0.368(36.8%). t/S가 2이면 R은 약 0.135(13.5%). t/S가 3이면 R은 약 0.050(5.0%). 비율이 1 올라갈 때마다 보유율이 대략 1/e(약 36.8%)로 줄어든다.

import math

# t/S 비율에 따른 보유율
for ratio in [0, 0.5, 1, 2, 3, 5]:
    R = math.exp(-ratio)
    print(f"t/S = {ratio:.1f} → R = {R:.4f} ({R*100:.1f}%)")

# t/S = 0.0 → R = 1.0000 (100.0%)
# t/S = 0.5 → R = 0.6065 (60.7%)
# t/S = 1.0 → R = 0.3679 (36.8%)
# t/S = 2.0 → R = 0.1353 (13.5%)
# t/S = 3.0 → R = 0.0498 (5.0%)
# t/S = 5.0 → R = 0.0067 (0.7%)

이게 의미하는 바는 명확하다. S가 클수록 같은 시간이 지나도 t/S 비율이 작으니까 R이 높게 유지된다. S는 곡선의 기울기를 완만하게 만드는 역할이다.

S가 올라가면 곡선이 어떻게 변하나

MemoryBank에서 S는 정수다. 처음 언급되면 1, 한 번 회상되면 2, 또 회상되면 3. 단순하다. 이 단순한 변화가 곡선에 어떤 영향을 주는지 구체적으로 본다.

경과 시간 t를 "일(day)" 단위로 잡겠다. S=1인 기억이 하루가 지나면 R은 얼마인가.

# S값별, 경과일별 보유율
for S in [1, 2, 3, 5]:
    print(f"\n--- S = {S} ---")
    for t_days in [0, 0.5, 1, 2, 3, 5, 7]:
        R = math.exp(-t_days / S)
        print(f"  t={t_days}일 → R = {R:.4f} ({R*100:.1f}%)")

결과를 정리하면 이런 그림이 나온다.

S=1일 때:  0일 100% → 1일 36.8% → 2일 13.5% → 3일 5.0% → 7일 0.1%
S=2일 때:  0일 100% → 1일 60.7% → 2일 36.8% → 3일 22.3% → 7일 3.0%
S=3일 때:  0일 100% → 1일 71.7% → 2일 51.3% → 3일 36.8% → 7일 9.7%
S=5일 때:  0일 100% → 1일 81.9% → 2일 67.0% → 3일 54.9% → 7일 24.7%

S=1인 기억은 하루 만에 36.8%로 추락한다. 일주일이면 0.1%. 사실상 소멸이다.

S=2이면 같은 하루가 지나도 60.7%가 남는다. S=5이면 하루 후에도 81.9%. 일주일이 지나도 24.7%가 살아있다.

한 번 회상할 때마다 S가 1씩 오르니까, 3번 회상된 기억(S=4)은 일주일이 지나도 17.4%가 남는다. 5번 회상된 기억(S=6)은 일주일 후에도 31.1%. 곡선이 확연히 눕는다.

이걸 게임으로 비유하면 "경험치 누적에 따른 버프 지속시간 증가"와 비슷하다. 같은 버프를 여러 번 걸면 지속시간이 점점 길어지는 메커니즘. 스택이 쌓일수록 효과가 오래간다. MemoryBank의 S도 정확히 그렇다.

t 리셋의 수학적 의미

S가 올라가는 것보다 t가 0으로 리셋되는 것이 사실 더 극적인 효과를 만든다.

시나리오를 하나 그려보자. 어떤 기억이 S=1 상태로 3일이 지났다. R은 e^(−3/1) = 0.050, 즉 5.0%다. 거의 사라질 뻔한 기억이다.

이 시점에 이 기억이 대화 중 회상됐다. MemoryBank는 S를 2로 올리고, t를 0으로 리셋한다. 이 순간 R은 e^(−0/2) = 1.000, 즉 100%로 돌아간다.

회상 전:  S=1, t=3일 → R = 5.0%  (거의 소멸)
회상 후:  S=2, t=0일 → R = 100%  (완전 부활)

5%에서 100%로. 이 점프가 핵심이다.

여기서 한 가지 더. 부활한 기억은 이전보다 더 강하다. S가 1에서 2로 올랐으니까, 다음에 같은 3일이 지나면 R이 5.0%가 아니라 22.3%가 된다. 첫 번째 생애에서는 3일이면 거의 죽었는데, 두 번째 생애에서는 3일이 지나도 아직 살아있다.

이걸 반복하면 이런 패턴이 된다.

# 회상 시나리오 시뮬레이션
# 기억이 3일마다 회상되는 경우
memory = {'S': 1, 't': 0}

for cycle in range(5):
    # 3일 경과
    memory['t'] = 3
    R_before = math.exp(-memory['t'] / memory['S'])

    # 회상 발생
    memory['S'] += 1
    memory['t'] = 0
    R_after = math.exp(-memory['t'] / memory['S'])

    print(f"사이클 {cycle+1}: 회상 전 R={R_before:.4f} ({R_before*100:.1f}%)"
          f" → 회상 후 S={memory['S']}, R=100%")

# 사이클 1: 회상 전 R=0.0498 (5.0%)  → 회상 후 S=2, R=100%
# 사이클 2: 회상 전 R=0.2231 (22.3%) → 회상 후 S=3, R=100%
# 사이클 3: 회상 전 R=0.3679 (36.8%) → 회상 후 S=4, R=100%
# 사이클 4: 회상 전 R=0.4724 (47.2%) → 회상 후 S=5, R=100%
# 사이클 5: 회상 전 R=0.5488 (54.9%) → 회상 후 S=6, R=100%

3일마다 회상되는 기억의 "회상 직전 보유율"이 5.0% → 22.3% → 36.8% → 47.2% → 54.9%로 올라간다. 같은 3일이 경과해도, 회상 횟수가 쌓일수록 기억이 점점 더 잘 유지된다.

에빙하우스가 발견한 간격 효과(spacing effect)가 수식 안에 자연스럽게 내장된 거다. 반복 복습하면 망각 곡선이 점점 완만해진다는 그 원리.

임계값은 어디에 잡아야 하나

MemoryBank 논문에서 구체적인 임계값 숫자를 명시하지는 않았다. 하지만 실제로 구현하려면 "R이 얼마 아래로 떨어지면 기억을 삭제할 것인가"를 결정해야 한다.

임계값을 정하려면 먼저 질문을 바꿔야 한다. "기억이 며칠 안 꺼내지면 잊혀야 하는가?"

S=1인 기억(한 번도 회상 안 된 기억)이 기준이다. 이 기억이 잊혀지기까지 걸리는 시간은 임계값에 따라 달라진다.

# S=1일 때, 임계값별 "잊혀지는 시간"
for threshold in [0.5, 0.3, 0.1, 0.05, 0.01]:
    # R = e^(-t/S) → t = -S * ln(R)
    t_forget = -1 * math.log(threshold)
    print(f"임계값 {threshold} → S=1 기억이 {t_forget:.2f}일 후 소멸")

# 임계값 0.5  → S=1 기억이 0.69일(약 17시간) 후 소멸
# 임계값 0.3  → S=1 기억이 1.20일(약 29시간) 후 소멸
# 임계값 0.1  → S=1 기억이 2.30일 후 소멸
# 임계값 0.05 → S=1 기억이 3.00일 후 소멸
# 임계값 0.01 → S=1 기억이 4.61일 후 소멸

임계값 0.5이면 17시간 만에 기억이 사라진다. 너무 공격적이다. 어제 한 대화가 오늘 이미 없다.

임계값 0.01이면 4.6일까지 버틴다. 좀 느슨하다. 의미 없는 대화가 거의 5일이나 남아 있으면 메모리가 비효율적이다.

임계값 0.05~0.1 사이가 실용적인 범위다. S=1 기억이 2~3일 안에 자연스럽게 사라지고, S=3 이상인 기억(2번 이상 회상)은 일주일 넘게 살아남는다.

이걸 역으로 쓸 수도 있다. "우리 서비스에서 중요한 기억은 최소 7일은 유지돼야 한다"는 요구사항이 있다면, S와 임계값을 역산할 수 있다.

# "7일 후에도 살아남으려면 S가 최소 얼마여야 하나?"
target_days = 7
threshold = 0.1

# R = e^(-t/S) ≥ threshold
# -t/S ≥ ln(threshold)
# S ≥ -t / ln(threshold)
S_min = -target_days / math.log(threshold)
print(f"7일 후 R ≥ 0.1 이려면 S ≥ {S_min:.2f}")

# 7일 후 R ≥ 0.1 이려면 S ≥ 3.04

S가 최소 4(3번 회상)는 돼야 7일 후에도 임계값 0.1 위에 머문다. 서비스 특성에 맞춰서 이런 식으로 파라미터를 튜닝할 수 있다.

반감기로 바라보기

지수 감쇠에서 가장 직관적인 지표는 반감기다. 보유율이 50%로 떨어지는 데 걸리는 시간.

# R = e^(-t/S) = 0.5
# -t/S = ln(0.5)
# t_half = S * ln(2) ≈ S * 0.693

for S in [1, 2, 3, 5, 10]:
    t_half = S * math.log(2)
    print(f"S={S:2d} → 반감기 = {t_half:.2f}일")

# S= 1 → 반감기 = 0.69일 (약 17시간)
# S= 2 → 반감기 = 1.39일 (약 33시간)
# S= 3 → 반감기 = 2.08일 (약 50시간)
# S= 5 → 반감기 = 3.47일
# S=10 → 반감기 = 6.93일 (약 1주일)

반감기가 S에 정비례한다. S가 2배가 되면 반감기도 2배. 선형 관계다.

한 번도 회상 안 된 기억(S=1)은 반감기가 17시간이다. 하루도 안 돼서 절반이 날아간다. 2번 회상된 기억(S=3)은 반감기가 2일. 9번 회상된 기억(S=10)은 반감기가 거의 일주일이다.

이 반감기 숫자를 보면 MemoryBank의 설계 의도가 선명해진다. 최근에 자주 나온 화제는 오래 기억하고, 한 번 스쳐 지나간 얘기는 빠르게 잊는다. 사람의 기억과 같은 패턴이다.

게임에서 유사한 시스템을 찾자면, 어그로(aggro) 감쇠가 있다. 플레이어가 보스한테 데미지를 안 넣으면 어그로가 시간에 따라 감쇠한다. 꾸준히 데미지를 넣으면 어그로가 유지되고 강화된다. 데미지 주입이 "회상"이고, 어그로 수치가 "기억 보유율"인 셈이다.

시뮬레이션: 10일간 5개 기억의 운명

실제 시나리오를 돌려본다. 5개의 기억이 서로 다른 패턴으로 회상되는 10일간의 시뮬레이션이다.

import math

THRESHOLD = 0.1

memories = {
    "이직 고민":    {"S": 1, "t": 0, "born": 0, "recalls": []},
    "점심 메뉴":    {"S": 1, "t": 0, "born": 0, "recalls": []},
    "UE5 버그":     {"S": 1, "t": 0, "born": 1, "recalls": []},
    "주말 캠핑":    {"S": 1, "t": 0, "born": 2, "recalls": []},
    "연봉 협상":    {"S": 1, "t": 0, "born": 3, "recalls": []},
}

# 회상 스케줄 (일 단위)
recall_schedule = {
    "이직 고민": [1, 3, 5, 8],   # 자주 회상
    "점심 메뉴": [],              # 한 번도 안 꺼냄
    "UE5 버그":  [2, 4],          # 가끔 회상
    "주말 캠핑": [3],             # 한 번만 회상
    "연봉 협상": [4, 6, 7, 8, 9], # 매우 자주 회상
}

print("=== 10일 시뮬레이션 ===\n")

for day in range(11):
    print(f"--- Day {day} ---")
    for name, mem in memories.items():
        if day < mem['born']:
            continue

        # 경과 시간 계산
        last_event = mem['recalls'][-1] if mem['recalls'] else mem['born']
        mem['t'] = day - last_event

        # 이 날 회상되는가?
        if day in recall_schedule[name]:
            mem['S'] += 1
            mem['recalls'].append(day)
            mem['t'] = 0

        R = math.exp(-mem['t'] / mem['S']) if mem['S'] > 0 else 0
        status = "ALIVE" if R >= THRESHOLD else "DEAD"

        print(f"  {name:8s}: S={mem['S']}, t={mem['t']}일, "
              f"R={R:.3f} ({R*100:.1f}%) [{status}]")
    print()

결과에서 핵심만 뽑으면 이렇다.

"점심 메뉴"는 한 번도 회상되지 않았다. S=1, Day 2에서 이미 R이 13.5%로 떨어지고, Day 3이면 5.0%. 임계값 0.1 기준으로 Day 3에 소멸한다.

"이직 고민"은 Day 1, 3, 5, 8에 회상됐다. Day 10이 되면 S=5, 마지막 회상으로부터 2일 경과. R은 e^(−2/5) = 67.0%. 아직 건강하다.

"연봉 협상"은 Day 4부터 거의 매일 회상됐다. Day 10이면 S=6, 마지막 회상으로부터 1일 경과. R은 e^(−1/6) = 84.6%. 가장 강한 기억이다.

"주말 캠핑"은 Day 3에 딱 한 번 회상됐다. S=2로 올랐지만, 그 후 7일 동안 아무도 안 꺼냈다. Day 10에서 R은 e^(−7/2) = 3.0%. 소멸이다.

같은 10일이라도 회상 패턴에 따라 운명이 완전히 갈린다. 이게 MemoryBank의 핵심 메커니즘이 만들어내는 효과다.

MemoryBank 모델의 수학적 한계

이 수식이 깔끔한 만큼, 현실과의 괴리도 명확하다.

S의 선형 증가 문제. 회상할 때마다 S가 단순히 1씩 오른다. 하지만 실제 인간의 기억 강화는 비선형이다. 첫 번째 복습의 효과가 가장 크고, 이후 복습의 효과는 점점 줄어든다(수확 체감). S_new = S_old + 1보다 S_new = S_old + 1/log(S_old + 1) 같은 감쇠 증가가 더 현실적이다.

감정 가중치의 부재. 에빙하우스 원래 실험은 무의미한 음절(WID, ZOF 같은)을 대상으로 했다. 의미 있는 정보는 무의미한 정보보다 10배 느리게 잊힌다고 에빙하우스 자신도 인정했다. MemoryBank는 "이직 고민"과 "점심 메뉴"의 초기 S를 똑같이 1로 놓는다. 감정적 중요도가 반영되지 않는다. 의미 가중치를 S 초기값에 반영하는 확장이 가능하다. LLM으로 대화의 감정 강도를 판별해서 S_init을 1~3으로 차등 부여하는 식.

시간 단위의 모호성. 논문에서 t의 단위를 명시하지 않는다. 일? 시간? 분? 대화 세션 수? 단위에 따라 곡선의 형태가 완전히 달라진다. 실서비스에 적용할 때 가장 먼저 결정해야 할 파라미터다.

회상 판정 기준. "이 기억이 대화 중 회상됐다"의 판정이 FAISS 검색 결과 기반이다. top-k에 포함되면 회상으로 치는 건지, 실제로 응답에 반영돼야 회상인지에 따라 S 증가 빈도가 달라진다. 검색됐지만 응답에 안 쓰인 기억도 S가 올라간다면, 기억 강도가 과대평가될 수 있다.

이 한계들은 MemoryBank를 확장할 수 있는 방향이기도 하다. 논문 저자들이 "탐색적이고 단순화된 모델"이라고 명시한 건, 이 수식이 최종 답이 아니라 출발점이라는 의미다. 기본 공식 위에 감정 가중치, 비선형 S 증가, 맥락 기반 회상 판정 같은 레이어를 얹으면 훨씬 정교한 메모리 시스템을 만들 수 있다.

R = e^(−t/S). 수식 하나가 기억의 탄생, 강화, 소멸을 전부 설명한다. 복잡한 메모리 아키텍처가 아니라, 심리학에서 140년간 검증된 원리를 LLM에 올린 거다. 단순하지만 효과적이다. 그리고 단순하기 때문에 확장 가능하다.

"수식은 단순할수록 강하다. 복잡한 건 구현하기 쉽지만, 단순한 건 확장하기 쉽다."

To Teach AI How to Remember, First Teach It How to Forget

김이더 — Mon, 13 Apr 2026 07:27:04 +0000

Code on GitHub. Paper on arXiv.
More posts at radarlog.kr.

I once asked ChatGPT about a conversation we had three days earlier. It had no idea. Tried Claude too. Same thing. Once the conversation ends, the memory vanishes.

But humans remember conversations from three days ago. Well, the important ones. You forget what you ate for lunch yesterday, but you remember your friend saying they're switching jobs. Memories that get recalled often stick around. Memories that never get pulled out fade naturally.

MemoryBank transplants this exact principle into LLMs.

What Is MemoryBank

It's a long-term memory mechanism for LLMs, built by Wanjun Zhong and colleagues at Sun Yat-sen University. The paper was accepted at AAAI 2024, and the full code is open-sourced on GitHub. 419 stars. MIT license.

The core idea is simple. In 1885, German psychologist Hermann Ebbinghaus discovered the forgetting curve — a mathematical model of how memory decays over time. MemoryBank applies this to AI memory systems.

R = e^(−t/S)

R is memory retention, t is elapsed time, S is memory strength. When you first learn something, S starts at 1. Over time, R drops sharply. Ebbinghaus found that 42% of new information is forgotten within 20 minutes, and 67% after a day. He tested this on himself with nonsense syllables.

Here's the key insight.

When a memory gets recalled even once, S increases by 1 and t resets to 0. That memory survives longer. Frequently recalled memories become progressively harder to forget, while untouched memories decay quickly.

If you've worked on game servers, this feels like session timeout logic. If a user doesn't connect, the session expires. Every connection resets the timer. Except MemoryBank doesn't just reset the timer — it extends the timeout window itself. Each reconnection makes the session last even longer.

Three Pillars

MemoryBank's architecture splits into three components: Memory Storage, Memory Retrieval, and Memory Updating. If you're a game developer, think ECS pattern — data component, read system, update system. Clean separation of concerns.

Memory Storage saves raw conversations with timestamps. But it doesn't just pile up chat logs. It uses LLM calls to generate daily event summaries, global event summaries, and user personality profiles — all maintained hierarchically. "The user talked about career doubts last Monday" sits alongside "The user is introverted and growth-oriented."

In Unreal Engine terms, it's like the SaveGame system. You keep the raw data intact while serializing key states separately. Later, you can reconstruct context from summaries alone without loading everything.

Memory Retrieval uses FAISS-based vector search. Every conversation turn and event summary gets encoded into vectors using an encoder model (MiniLM for English, Text2vec for Chinese). When new conversation comes in, the current context gets vectorized and matched against the FAISS index for the most relevant memories. The whole pipeline is built on LangChain.

# When a new message arrives
query_vector = encoder.encode(current_context)

# Search FAISS for relevant memories
relevant_memories = faiss_index.search(query_vector, top_k=5)

# Inject retrieved memories + user profile + event summary into prompt
prompt = build_prompt(relevant_memories, user_portrait, event_summary)
response = llm.generate(prompt)

The beauty here is that you don't have to cram the entire conversation history into the context window. Even Claude's 200K token window fills up fast in long conversations. MemoryBank cherry-picks only the relevant memories for the prompt, so token efficiency is much better.

Memory Updating is the heart of this whole thing. It applies the forgetting curve formula to every memory piece, calculating retention R. When R drops below a threshold, that memory gets removed or weakened. Memories recalled during conversations get their S bumped up and t reset, so they survive.

import math

def calculate_retention(t, S):
    """Calculate memory retention"""
    return math.exp(-t / S)

def update_memory(memory_item, recalled=False):
    """Update a memory piece"""
    if recalled:
        memory_item['S'] += 1   # Increase memory strength
        memory_item['t'] = 0    # Reset elapsed time

    R = calculate_retention(memory_item['t'], memory_item['S'])

    if R < THRESHOLD:  # If retention falls below threshold
        memory_item['status'] = 'forgotten'

    return memory_item

It's simple. That simplicity is the point. The authors explicitly state this is "an exploratory and highly simplified memory updating model." Real human memory is far more complex, but for LLM memory purposes, this level of simplification is effective enough.

SiliconFriend — Memory Is the Prerequisite for Empathy

SiliconFriend is the chatbot built on top of MemoryBank. It's not just memory bolted on — they also fine-tuned it with 38k psychological counseling dialogues using LoRA. Rank 16, 3 epochs on a single A100.

Why psychological data? Because memory and empathy can't be separated. To ask "You mentioned thinking about switching jobs last time — how did that go?", two things are needed: remembering the job talk, and empathizing naturally when bringing it up. MemoryBank handles the former. The psychological fine-tuning handles the latter.

The experiments make this clear. Base ChatGLM gives textbook comfort when you say "I'm having a rough time." SiliconFriend adjusts its response based on personality profiles built from past conversations. Cautious approach for introverted users, more active engagement for extroverted ones.

The evaluation setup is solid too. ChatGPT role-played 15 virtual users with different personalities, generating 10 days of conversation history. From that history, 194 memory probing questions were created to measure recall accuracy. ChatGPT-based SiliconFriend scored highest, while open-source models (ChatGLM, BELLE) were still competitive on retrieval accuracy — just weaker on response naturalness, reflecting their base model capabilities.

How Is This Different from Current AI Memory

Claude, ChatGPT, Claude Code. All three have their own memory systems. But the approaches are fundamentally different.

ChatGPT pre-computes conversation summaries and injects them into every chat. It's automatic. No user effort needed. But summaries lose nuance through compression.

Claude takes the opposite approach. Memory tools are on-demand. You can search past conversations, but Claude has to decide "I should search now" for it to work. If it doesn't think to look, context stays buried. The trade-off: when it does search, it pulls from raw conversations, so depth is there.

Claude Code uses CLAUDE.md files. Write project context in markdown, and it auto-loads at session start. Transparent and editable, but performance degrades as files grow. There's a 200-line index limit too.

None of them have a forgetting mechanism. That's the critical difference from MemoryBank.

ChatGPT accumulates summaries indefinitely. Claude's conversation history keeps growing. Claude Code's CLAUDE.md gets bloated unless you manually prune it. If nothing ever gets forgotten, memory ironically becomes useless. When everything has equal weight, the truly important memories get harder to find quickly.

MemoryBank introduces "forgetting" into this picture. Old, never-recalled memories naturally disappear. Only frequently recalled memories get reinforced. The result: your memory store contains only what actually matters. This is also a performance optimization — smaller FAISS index, better retrieval accuracy.

How to Plug It Into Your Project

You can use MemoryBank directly, or just borrow the core ideas. Two paths.

Path 1: Use the repo as-is. Clone from GitHub, run pip install -r requirement.txt, set up your OpenAI API key. The ChatGPT-based SiliconFriend is the easiest to get running. Put your API key in SiliconFriend-ChatGPT/launch.sh and run it. --language=en for English, --language=cn for Chinese.

If you want open-source models, you need to set up ChatGLM or BELLE as the base, then download LoRA checkpoints. Requires an A100 80GB environment. A bit heavy for personal projects.

Path 2: Transplant the core mechanism into your own code. This is the practical route. You need three things.

First, a storage layer that saves conversations with timestamps. JSON works fine. At the end of each conversation, call an LLM API to generate daily summaries and personality summaries.

memory_storage = {
    "conversations": [
        {
            "timestamp": "2026-04-13T14:30:00",
            "role": "user",
            "content": "I'm stuck on this UE5 Slate widget layout",
            "S": 1,  # Memory strength (initial)
            "t": 0   # Elapsed time
        }
    ],
    "daily_summaries": {},
    "user_portrait": "Game dev, UE5 C++ specialist, introverted, problem-solving oriented"
}

Second, embedding + vector search. Use sentence-transformers for embedding, FAISS for indexing. With LangChain, this pipeline takes a few lines.

Third, forgetting curve-based updates. Once a day, or before each conversation starts, sweep through all memories and calculate R. Remove anything below a threshold (say, 0.1). During conversations, bump S and reset t for any retrieved memory.

Combining these three gives you long-term memory for any chatbot or AI agent. Especially effective for services with repeated user interactions — AI tutors, AI coaches, customer support bots, game NPCs.

Imagine plugging this into game NPCs. A player tells an NPC about their adventures multiple times — the NPC remembers longer each time. A passing conversation the player had once? The NPC forgets it. Pretty natural behavior.

Limitations and Caveats

MemoryBank is validated research accepted at AAAI 2024, but it has clear limitations.

Incrementing S as a simple integer doesn't reflect reality. Human memory strength is influenced by emotional significance, sleep, stress, and other variables. MemoryBank ignores all of these and decides S based solely on recall count. The authors explicitly acknowledge this.

Another thing. Memory summarization and personality profiling require LLM API calls. Calling the summarization API after every conversation means costs accumulate as conversations increase. For production deployment, you'd need to adjust summarization frequency or offload the summary layer to a local lightweight model.

Finally, the base repo hasn't seen major updates since its May 2023 release. The ChatGLM 6B and BELLE 7B base models are dated at this point. But the architecture itself is model-agnostic. You can plug it into Claude, GPT-4o, Gemma, Llama — anything. The point isn't the model. It's the memory mechanism.

In the next post, I'll break down the R = e^(−t/S) formula mathematically. What happens to the curve when S goes from 1 to 5, where to set the threshold, and a simulation to see it all in action.

"Perfect memory isn't memory at all. Only memory that knows how to forget is real."