Forem: Shinsuke Matsuda

How Do We Keep Evolving a 100k-Line Codebase in the Age of AI?

Shinsuke Matsuda — Thu, 22 Jan 2026 13:24:33 +0000

Plan Stack as a Methodology

Imagine a codebase with over 100,000 lines of code.

Six months ago, an AI-generated pull request added several thousand more.
The tests passed. The review was rushed. The change was merged.

Today, you need to modify that area.

You can read the code.
You can see what it does.

But you have no idea why it is the way it is.

No one remembers what constraints existed.
No one knows which alternatives were considered.
And the person who “wrote” the code was an AI.

This situation is no longer rare — and it’s not temporary.

The real problem is not “too much code”

Traditionally, even when code was messy, we could infer intent.

We knew who wrote it
We remembered the discussion
We could reconstruct the reasoning from context

With AI-generated code, intent lives somewhere else.

It depends entirely on what instructions were given to the AI —
instructions that are usually not preserved.

So reviewers are left asking:

Was this design intentional or accidental?
Were other options considered?
What constraints existed at the time?

The code doesn’t answer these questions.

As this repeats at scale, reviews slowly degrade into:

“Does this look obviously broken?”

Eventually, review itself stops scaling.

This is not a review problem — it’s a methodology problem

The real issue is bigger:

How do we keep evolving a 100k- or 1M-line codebase

over years, when most of the code is written by AI?

This is not about productivity.

It’s about control, maintenance, and long-term evolution.

“Just write plans and review them” — that’s only the entrance

Plan Stack is often summarized as:

“Write a plan first, commit it, and review the plan.”

That sounds like a review optimization trick.

It’s not.

That’s just the entrance.

The real value of Plan Stack

Plan Stack provides a structure that makes AI-driven development sustainable at scale.

More precisely, it enables three things that large, long-lived codebases
struggle with in the AI era.

Scale: letting humans review intent, not output

AI can generate unlimited code.
Humans cannot review unlimited details.

In large systems, the bottleneck is no longer implementation —
it’s human judgment.

By introducing plans as first-class artifacts,
humans review intent instead of raw diffs:

What is in scope for this change?
What is explicitly out of scope?
Which trade-offs are being made this time?

This allows a small number of humans to stay in control,
even as AI-generated output grows by orders of magnitude.

Maintainability: preserving the “why” over time

Six months later, the hardest question is never:

“What does this code do?”

It’s:

“Why is it like this?”

Plan Stack makes that answer explicit.

Not as comments.
Not as detached documentation.

But as a plan that is reviewed, committed,
and permanently associated with the change.

This is what allows future maintainers
— including your future self —
to re-enter the decision context quickly.

Continuous evolution: plans as accumulated knowledge

A plan is not a disposable note.

In a long-lived codebase,
plans accumulate as a decision history:

Why this abstraction exists
Why a shortcut was acceptable at the time
Why a deeper refactor was deferred

When the next change comes,
those past plans become shared context
for both humans and AI.

As the codebase grows to hundreds of thousands or millions of lines,
this accumulated judgment becomes more valuable, not less.

Isn’t this just ADRs or design docs?

This question always comes up.

ADRs and design documents are excellent at capturing
big, infrequent decisions:

Architecture choices
Technology selection
Long-term direction

But AI-driven development explodes the number of
small, local decisions:

How much abstraction is enough for this change?
Should we optimize now or accept some debt?
Which edge cases are intentionally ignored?

These decisions shape the codebase,
but rarely make it into ADRs.

Plan Stack exists exactly at this missing layer.

The key difference: proximity to code

ADRs and design docs are distant from code:

Rarely updated
Easy to drift out of sync
Not part of the PR review loop

Plans, in contrast:

Are written per PR
Are reviewed together with code
Live and die with the change itself

Plans are not documentation.

They are part of the commit.

Is the plan for humans or for AI?

At first glance, plans look like instructions for AI.

They do help AI produce more stable output.

But that’s not the point.

AI can write code without plans.

Humans can’t judge AI-written code without them.

The real beneficiary of Plan Stack is the human.

Plans as externalized human reasoning

AI-generated code removes the human “thinking trace” from the artifact.

Plans restore it.

They externalize:

What was decided
What was deferred
Where compromises were made

This preserved judgment is what makes large AI-written codebases
maintainable instead of merely functional.

Plan as discipline: how humans retake control

This process is not just a workflow improvement.

It is a form of discipline in AI-driven development.

The simple rule —
“don’t let AI write code immediately” —
creates a crucial buffer.

That buffer pulls control back to the human side.

1. Shift-left review: intervene at the cheapest point

In traditional development,
post-implementation code review
was the last line of defense for quality.

In the AI era, reviewing thousands of generated lines after the fact
is not realistic.

Having the AI write a plan first, and letting humans review that plan,
means intervening where mistakes are cheapest to fix.

This prevents development from turning into
a slot machine of “generate → patch → regenerate”.

2. Reducing cognitive load, focusing on decisions

Writing a plan from scratch is cognitively expensive for humans.

But having the AI produce a draft plan changes the equation.

Humans no longer start from a blank page —
they review, adjust, and approve.

This lets humans focus on the highest-value activity:
decision making.

AI: expands the space — options, dependencies, edge cases
Humans: narrow it — scope, trade-offs, priorities

3. The plan as a contract

Once a plan is reviewed and agreed upon,
code generation becomes contract execution.

If the output looks wrong, the question becomes clear:

Was the plan ambiguous?
Or did the AI fail to execute it?

Spending more time reviewing the plan
often results in less time spent coding and debugging overall.

This is how Plan Stack achieves both scale and control.

The future role of humans: the decision-maker

In the AI era, human roles converge toward one thing:

Making decisions under constraints.

What matters now?
What can wait?
What risks are acceptable?

AI executes.

Humans decide.

Ownership without authorship

When AI writes most of the code, authorship becomes blurry.

Ownership doesn’t.

Ownership comes from recorded judgment.

If decisions are preserved,
the codebase remains human-owned —
even when AI-written.

Without plans, the future looks like this

Not “unmaintainable” code — worse.

Code that is untouchable.

No one dares to modify existing logic,
so new behavior gets added next to it.

The same concept ends up implemented three times,
each slightly different,
because no one knows which constraints still matter.

The system grows, but it doesn’t evolve.

Conclusion

AI writes code.

Humans decide.

And decisions must survive longer than any single implementation.

Plan Stack is a minimal structure
that allows humans to remain decision-makers —
even as AI becomes the primary author.

A small invitation to try

If this sounds heavy, start small.

In your next PR, write a short plan first:
what you decided, and what you intentionally didn’t.

It doesn’t need to be perfect.

Just make the judgment explicit — and commit it with the code.

You’ll notice the shift immediately:

AI-assisted development stops being generation

and starts becoming control.

Real-Time Staff Matching Without Heavy Optimization: An Insertion-Based Approach

Shinsuke Matsuda — Tue, 20 Jan 2026 12:25:09 +0000

0. Introduction: The Hidden Complexity of Staff Matching

I’m working on a service that matches staff members with customer requests.

At first glance, the problem looks trivial:

Pick the nearest staff member
Or assign someone who is available

But in reality, staff matching quickly becomes complicated once you consider real-world constraints:

Each job has a fixed start time and duration
Staff members already have scheduled jobs
Different locations imply travel time
Skill requirements must be satisfied

Once you try to handle all of these at the same time,

simple nearest-neighbor searches or availability checks start to break down.

At the same time, we often have strong practical requirements:

We need to select one staff member in real time
We don’t want heavy optimization (ILP / MIP solvers)
The logic must be explainable and operable in production

This article describes a practical matching approach that actually works under those constraints.

What This Article Covers (and What It Doesn’t)

What this article covers

A practitioner-oriented way to frame the problem
A simple but powerful insertion-based algorithm
Why this approach works well in real systems
Engineering decisions that make it maintainable

What this article does not cover

Mathematical formulations
Exact optimization models
Detailed solver implementations

This is intentionally not an academic treatment.

1. Problem Setting (From a Practitioner’s Perspective)

We consider the following matching problem.

Job (Request)

Each job has:

Start time
Duration (e.g. 30-minute units, up to several hours)
Location (latitude / longitude)
Required skills

Staff

Each staff member has:

A set of skills
Existing scheduled jobs (each with time and location)
A base location (e.g. office or home)

Constraints

A staff member cannot work on overlapping jobs
Travel time between locations must be considered
A free time slot does not necessarily mean feasibility

Goal

Given a single job request, select the best staff member in real time.

This is closer to a recommendation problem than a global assignment problem.

2. Common but Broken Approaches

Before describing the solution, it’s useful to look at approaches that often fail in practice.

Picking the Nearest Staff Member

Ignores the staff member’s previous job
Travel time may make the job infeasible
Schedules look valid on paper but fail in reality

Picking Anyone Who Is Available

Time gaps are misleading
Feasibility depends on adjacent jobs
Decisions are hard to explain afterward

Global Optimization (ILP / MIP)

Computationally expensive
Complex to implement and operate
Poor fit for real-time, online decisions

These approaches are reasonable in theory, but often don’t survive production constraints.

3. A Shift in Perspective: From Assignment to Insertion

Here is the key mental shift.

A staff member’s day can be represented as a sequence of (time, location) pairs:


[ 09:00 @ A ] → [ 11:00 @ B ] → [ 15:00 @ C ]

A new job is not something to assign globally.

It is a block that might be inserted somewhere into this sequence.

So the core question becomes:

Can this job be inserted into the staff member’s schedule without breaking feasibility?

This framing turns a complex matching problem into a series of local feasibility checks.

4. Insertion Heuristics (In Plain Terms)

After implementing this approach, I later learned that it has a name:

Insertion heuristics.

They are commonly used in:

Vehicle Routing Problems (VRP)
Technician Scheduling Problems
Dynamic routing and scheduling

The general idea is:

Take an existing route or schedule
Try inserting a new job at a feasible position
Choose the insertion with the lowest cost

In our case:

Jobs arrive online, one by one
We must decide immediately
We only select one staff member

5. Algorithm Overview (Implementation-Oriented)

The algorithm itself is surprisingly simple.

Filter by required skills
Filter by coarse time availability
Find the immediately previous and next jobs
Check feasibility including travel time
Compute an insertion cost
Select the staff member with the lowest score

Two principles matter most:

Check feasibility before scoring
Only score candidates that actually work

This keeps the algorithm fast and robust.

6. Why This Works Well in Practice

This approach works well in real systems for clear reasons:

Low computational cost
Suitable for real-time responses
No need for post-hoc constraint fixing
Business logic fits naturally into the scoring function

It feels like optimization,

without behaving like heavy optimization.

7. Engineering Design: Keep the Algorithm Pure

One important design choice is keeping the matching logic as a pure function.

Key principles

No database access inside the algorithm
All required data is passed as a snapshot
Same input always produces the same output

Benefits

High reproducibility
Easy unit testing
Simple A/B testing
Easy replacement or extension in the future

This separation pays off quickly in real projects.

8. Why We Don’t Encode This in SQL

It’s tempting to push matching logic into complex SQL queries.

In practice, this often leads to:

Large JOINs
Window functions everywhere
Hard-to-debug logic
Unstable query performance

Matching logic is procedural by nature:

Find adjacent jobs
Check travel feasibility
Evaluate insertion cost

These steps are far more naturally expressed in application code than in SQL.

9. Practical Notes on Data Loading

This does not mean loading everything into memory.

In production systems:

Pre-filter candidate staff members
Load only nearby or relevant scheduled jobs
Keep snapshots minimal and fresh

Final booking should still be handled by a separate, transactional process.

Separating recommendation from commitment keeps the system stable.

10. Limitations and Trade-Offs

This approach is not perfect.

It does not produce a global optimum
It is not suitable for batch optimization of many jobs

However, for the question:

“Who should handle this job right now?”

…it delivers more than enough value.

11. Conclusion

Real-world constraints naturally lead to insertion-based thinking
Heavy optimization is not always the right tool
Choosing not to optimize globally is also a design decision

Even with time, travel, and skill constraints,

a practical and maintainable solution is possible.

TL;DR: Code Reviews Break in the AI Era — Plans Fix Them

Shinsuke Matsuda — Tue, 20 Jan 2026 09:43:26 +0000

TL;DR

AI makes writing code dramatically faster.

But that speed quietly breaks code reviews.

The problem isn’t large diffs.

The real problem is that reviewers no longer know where to start.

The fix is not “better reviews” — it’s better plans.

In AI-assisted development, a plan is not a TODO list.

A plan is:

the unit of review (intent, scope, boundaries)
the unit of generation (what AI should and should not change)
the unit of knowledge (what gets promoted to docs later)

When plans are treated as first-class artifacts and committed to the repo:

Reviews start from intent, not diff scanning
PR sizes shrink structurally
Human reviews focus on judgment, not syntax
AI reviews become intent-aware
Context window usage drops
Future changes become cheaper

Plans are not documents. They are process.

If AI writes more code, humans must decide more clearly —

and plans are how we do that.

👉 Read the full article:

Why Plans Should Be First-Class Artifacts in AI-Assisted Development

Why “Plans” Should Be First-Class Artifacts in AI-Assisted Development

Shinsuke Matsuda — Tue, 20 Jan 2026 09:37:22 +0000

Why Plans Should Be First-Class Artifacts in AI-Assisted Development

AI-assisted development is no longer experimental.

At this point, it’s fair to say it has become mainstream.

There are, of course, organizations that cannot adopt generative AI yet due to security or regulatory constraints.

But for software teams that can use AI and still choose not to, the issue is no longer technical—it’s a matter of mindset.

Productivity gains from AI-assisted development are real.

Today, shipping and operating small to mid-sized services with minimal handwritten code is no longer unrealistic.

And yet, many teams report the same frustrations:

“We can’t keep track of the massive amount of AI-generated code.”
“AI-written code is hard to maintain.”
“Pull requests are getting huge, and reviews have become the bottleneck.”

I believe these problems are not caused by limitations of AI itself.

They stem from the fact that our development processes haven’t caught up with the AI era.

The Real Problem: Reviews Can’t Even Start

AI dramatically increases the speed of code generation.

But the real pain in AI-assisted development is not that diffs are large.

The real problem is this:

Reviewers don’t know where to start.

When reviewing an AI-generated pull request, reviewers implicitly need answers to questions like:

What is the intent of this change?
What is the scope of this PR? (Where is the review boundary?)
What assumptions are being preserved?
Which parts are risky and need extra attention?

Without an explicit plan, reviewers are forced to reconstruct all of this information from the diff alone.

As AI writes more code faster, this reconstruction step becomes the true bottleneck.

What “Plan” Means in This Article

Before going further, let’s define what “plan” means here.

This is not about tools or features.

It’s about how we structure information to make reviews work in the AI era.

A plan is not a TODO list.

A plan is a compressed decision artifact that makes implementation and review possible.

A good plan typically includes:

Findings from investigating the existing codebase (what matters, what assumptions exist)
Goals and non-goals
References to relevant specs or architecture
Expected impact and risks
The review boundary (how far this PR should go)

The key point is that a plan is reviewable in size.

It is not a long design document.

Why Plans Should Be Committed

Here’s an important point:

the size and quality of plans described above are not theoretical.

Modern AI tools (including Claude Code) can reliably generate plans at this granularity.

That makes plans not temporary notes, but real artifacts worth keeping.

1) A Plan Is a Deliverable

In AI-assisted development, a plan is no longer a personal memo.

A solid plan contains:

Consolidated investigation results
A clear list of affected files
Explicit intent and boundaries
Rejected alternatives and caveats

At this point, the plan has become shared team knowledge.

It is agreed upon before implementation and often referenced longer than the code itself.

If that’s the case, it should be treated like code.

Anything that changes over time should live where change history is tracked best:

in git.

2) Reviews Can Start with Intent, Not Diff Reading

Without a plan, reviews inevitably start like this:

Read the diff top to bottom
Try to infer intent
Guess the impact
Discover late that “this wasn’t the intended direction”

With a plan, reviews start somewhere else:

Is this the right direction?
Is this review boundary safe?
Are the stated invariants preserved?
Which parts deserve the most attention?

As a result, reviews shift from

“code inspection” to “decision validation.”

3) PR Sizes Shrink Structurally

When the review boundary is explicitly written in the plan, agreement comes first.

This PR stops here
Behavior is unchanged in this step
Follow-up work goes into the next PR

With this structure:

Massive PRs become rare
“While I’m here” changes are easier to reject
Reviews stop collapsing under their own weight

In other words, we convert

“how much AI can generate” into “how much humans can reasonably decide.”

Plans Improve Review Quality—for Humans and AI

1) Human Reviews Shift Focus

Reviewing diffs alone leads to flat review criteria:

Is the code style okay?
Does it look bug-free?
Are tests present?

These are important, but increasingly they are areas where

automated checks and AI reviews excel.

That means human review time is better spent on judgment-heavy areas:
design intent, boundaries, invariants, and risk.

With a plan, reviews gain contrast:

High-impact areas get deeper scrutiny
“Non-goals” are explicitly verified
Risky paths are reproduced locally

The result isn’t less review effort—it’s better reviews.

2) AI Reviews Become Intent-Aware

When AI reviews only see diffs, feedback tends to be generic.

When plans are included, AI reviews can instead check:

Are non-goals violated?
Did changes cross the review boundary?
Are promised invariants preserved?
Are impact estimates missing something?

AI reviews evolve from

diff interpretation to intent verification.

The Decisive Value of Plans in the AI Era

1) Plans Save Context Window

Plans change what we pass to AI.

No need to paste full chat histories
No need to dump raw logs
Only relevant documents and summarized findings

A plan is a compressed state representation:

Readable by humans
Easy for AI to understand
Cheap in context window usage

This leaves more room for actual reasoning.

2) Plans Shortcut File Discovery

For similar follow-up tasks, the most expensive step is often figuring out:

Which files matter
What depends on what

If plans are committed, AI doesn’t need to re-scan the entire codebase.

It can reuse explicit file lists, dependencies, and constraints—

avoiding unnecessary context pollution.

3) Plans Promote Knowledge

Plans are short-lived by nature, but they often contain durable knowledge.

Feature-level knowledge can graduate into documentation
Constraints and boundaries can move into architecture docs

When teams consistently:

Aggregate knowledge in plans
Promote it after completion

They stop relying on past chat logs as an external memory.

Plans Are Not Documents—They Are Process

The goal is not to treat plans as optional artifacts.

A healthy loop looks like this:

Commit plans
Review plans (intent, boundaries, non-goals)
Update plans alongside implementation
Promote lasting knowledge into canonical docs

This loop becomes the development process in the AI era.

Summary

Faster AI-generated code makes reviews more fragile
The root cause is not diff size, but the loss of a clear review starting point
Treating plans as first-class artifacts enables:
- Reviews to start from intent
- Structurally smaller PRs
- Higher-quality human and AI reviews
- Lower context window usage
- Cheaper future exploration

Plans are the unit of review, generation, and knowledge in AI-assisted development.

That’s why plans are deliverables—and why they belong in git.

The 2M Token Trap: Why "Context Stuffing" Kills Reasoning

Shinsuke Matsuda — Sun, 11 Jan 2026 14:56:20 +0000

Why more context often makes LLMs worse—and what to do instead

1. Introduction

The Context Window Arms Race

The expansion of context windows has been staggering:

Early 2023: GPT-4 launches with 32K tokens
November 2023: GPT-4 Turbo extends to 128K
March 2024: Claude 3 reaches 200K
February 2024: Gemini 1.5 hits 1M—later expanding to 2M

In just two years, context capacity grew from 32K to 2M tokens—a 62× increase.

The developer intuition was immediate and seemingly logical:

“If everything fits, just put everything in.”

The Paradox: More Context, Worse Results

Practitioners are discovering a counterintuitive pattern:

the more context you provide, the worse the model performs.

Common symptoms include:

Passing an entire codebase → misunderstood design intent
Including exhaustive logs → critical errors overlooked
Providing comprehensive documentation → unfocused responses

This phenomenon has a name in the research literature:

“Lost in the Middle” (Liu et al., 2023).

Information placed in the middle of long contexts is systematically neglected.

The uncomfortable truth is this:

A context window is not just storage capacity. It is cognitive load.

This article explores why Context Stuffing fails, what Anthropic’s Claude Code reveals about effective context management, and how to shift from Prompt Engineering to Context Engineering—the discipline of architectural curation for AI systems.

2. Why “More Context” Doesn’t Mean “Better Understanding”

Capacity vs. Capability

We must distinguish between two fundamentally different concepts:

Capacity: How much data fits in memory (e.g. 200K, 2M tokens)
Capability: The ability to prioritize, connect, and reason over that data

Just because a model can ingest 2 million tokens does not mean it can pay attention to them equally.

Providing a 2M-token context to an LLM is like handing a new developer 10,000 pages of documentation on day one and expecting them to fix a bug in five minutes.

They won’t understand the system—they will immediately drown in it.

Attention Dilution and “Lost in the Middle”

This limitation is rooted in the self-attention mechanism.

As token count increases, attention distributions flatten, signal-to-noise ratios drop, and relevant information gets buried.

Liu et al. (2023) demonstrated that information placed in the middle of long contexts is systematically neglected—even when explicitly relevant—while content at the beginning and end receives disproportionate attention.

In short:

Context expansion increases what can be accessed, not what can be understood.

Real-World Symptoms

In practice, adding information often degrades accuracy:

Entire codebases → architectural misinterpretation
Exhaustive logs → critical signals buried
Comprehensive docs → answers drift off-topic

These are not failures of model intelligence.

They are failures of information structure and prioritization—problems no amount of context capacity can solve.

3. The 75% Rule: Lessons from Claude Code

The Problem: Quality Degradation in Long Sessions

The strongest evidence against Context Stuffing comes from Claude Code, Anthropic’s terminal-based coding agent with a 200K context window.

In early 2024, users reported recurring issues:

Code quality degraded over long sessions
Earlier design decisions were forgotten
Auto-compact sometimes failed, causing infinite loops

At the time, Claude Code routinely used over 90% of its available context.

The Solution: Auto-Compact at 75%

In September 2024, Anthropic implemented a counterintuitive fix:

Trigger auto-compact when context usage reaches 75%.

This meant:

~150K tokens used for storage
~50K tokens deliberately left empty

What looked like waste turned out to be the key to dramatic quality improvements.

Why It Works: Inference Space

Several hypotheses explain why this works:

Context Compression — Low-relevance information is removed
Information Restructuring — Summaries reorganize scattered data
Preserving Room for Reasoning — Empty space enables generation

As one developer put it:

“That free context space isn’t wasted—it’s where reasoning happens.”

This mirrors computer memory behavior:

Running at 95% RAM doesn’t mean the remaining 5% is idle—it’s system overhead. Push to 100%, and everything grinds to a halt.

Takeaway

Filling context to capacity degrades output quality.

Effective context management requires headroom—space reserved for reasoning, not just retrieval.

4. The Three Principles of Context Engineering

The era of prompt wording tweaks is ending.

As Hamel Husain observed:

“AI Engineering is Context Engineering.”

The critical skill is no longer what you say to the model, but what you put in front of it—and what you deliberately leave out.

Principle 1: Isolation

Do not dump the monolith.

Borrow Bounded Contexts from Domain-Driven Design.

Provide the smallest effective context for the task.

Example: Add OAuth2 authentication

Needed:

User model
SessionController
routes.rb
Relevant auth middleware

Not needed:

Billing module
CSS styles
Unrelated APIs
Other test fixtures

Ask:

What is the minimum context required to solve this problem?

Principle 2: Chaining

Pass artifacts, not histories.

Break workflows into stages:

Plan → Execute → Reflect

Each stage receives only the previous stage’s output—not the entire conversation history.

This keeps context fresh and signal-dense.

Ask:

Can this be decomposed into stages that pass summaries instead of transcripts?

Principle 3: Headroom

Never run a model at 100% capacity.

Adopt the 75% Rule.

Token limits usually cover input + output. Stuffing 195K tokens into a 200K window leaves almost no room for reasoning.

Ask:

Have I left enough space for the model to think—not just respond?

Treat the context window as a scarce cognitive resource, not infinite storage.

5. Why 200K Is the Sweet Spot

Despite 2M-token models, 200K is the practical sweet spot for Context Engineering.

Cognitive Scale

150K tokens (75% of 200K) is roughly one technical book—about the largest coherent “project state” both humans and LLMs can manage. Beyond that, you need chapters, summaries, and architecture.

Cost and Latency

Attention scales at O(n²).

Doubling context quadruples cost.

200K balances performance, latency, and cost.

Methodological Discipline

200K forces curation.

Exceeding it is a code smell: unclear boundaries, oversized tasks, or stuffing instead of structuring.

Anthropic offers 1M tokens—but behind premium tiers.

The implicit message:

1M is for special cases. 200K is the default for a reason.

The constraint is not a limitation—it is the design principle.

6. Conclusion: From Prompt Engineering to Context Engineering

The context window arms race delivered a 62× increase in capacity.

But capacity was never the bottleneck.

The bottleneck is—and always has been—curation.

The shift is fundamental:

Prompt Engineering	Context Engineering
“How do I phrase this?”	“What should the model see?”
Optimizing words	Architecting information
Single-shot prompts	Multi-stage pipelines
Filling capacity	Preserving headroom

Three Questions to Ask Before Every Task

Am I stuffing context just because I can?

Relevant beats exhaustive.
Is this context isolated to the real problem?

If you can’t state the boundary, you haven’t found it.
Have I left room for the model to think?

Output quality requires input restraint.

The era of prompt engineering rewarded clever wording.

The era of context engineering rewards architectural judgment.

The question is no longer:

What should I say to the model?

The question is:

What world should the model see?

7. References

Research Papers

Liu et al., Lost in the Middle: How Language Models Use Long Contexts (2023) https://arxiv.org/abs/2307.03172