Forem: John Nichev

I Spent Four Weeks Reading 200+ Sources on Context Engineering. Here's What I Built.

John Nichev — Wed, 08 Apr 2026 22:13:17 +0000

A launch post for nv:context, a Claude Code skill that sets up context engineering for any repository in three minutes.

The wall I kept hitting

I build production Python services with AI coding agents. Claude Code, Cursor, Copilot, the whole rotation. And no matter how carefully I wrote my CLAUDE.md files, I kept hitting the same wall: the agent would forget rules mid-session, run the wrong test command, or touch files it shouldn't.

I did what most people do. I wrote longer CLAUDE.md files. Added more "don't do X" instructions. Tried /init. Nothing clicked.

Eventually I sat down to figure out why. Four weeks later, I had read 200+ sources on what the research calls context engineering. The picture was clearer than I expected, and uglier. Here's the punchline:

Bad context doesn't just not help. It actively hurts.

ETH Zurich found that auto-generated agent config files reduce success by 3% and increase costs 20%+

METR ran a controlled study on experienced developers and found they were 19% slower with AI tools when context was poorly managed, despite feeling 24% faster

FlowHunt / LongMemEval showed that a focused 300-token context outperforms an unfocused 113K-token context on the same task

Dex Horthy has shown that using 40% of the context window outperforms using 90%

Anthropic and Manus production data: below 60% context utilization is safe. At 70%, precision drops. At 85%, hallucinations begin.

The thing that shifted my thinking was Philipp Schmid's line:

"Most agent failures are not model failures. They are context failures."

The 8 laws that came out of it

Less is more. Every line in your context competes with the actual task for attention.
Landmines, not maps. Document what agents can't discover by reading the code.
Commands beat prose. One snippet showing npm run test -- --coverage --maxWorkers=2 beats three paragraphs.
Context is finite. Frontier LLMs follow roughly 150 to 200 instructions consistently.
Progressive disclosure. Layer it: root CLAUDE.md, subdirectory CLAUDE.md, skills, MCP tools.
Hooks for determinism. If a rule MUST be followed 100% of the time, use a hook.
Negative instructions backfire. "Don't use moment.js" makes models more likely to use moment.js. Say "MUST use date-fns" instead.
Compact proactively. Don't wait for Claude to compact at 95%. Update HANDOFF.md, run /clear, start fresh.

The hierarchy of leverage

Priority  Layer                    Compliance  Cost to set up
───────────────────────────────────────────────────
   1      Verification             100%        Medium
   2      CLAUDE.md / AGENTS.md    90-95%      Low
   3      Hooks                    100%        Low
   4      Skills                   ~79%        Medium
   5      Subagent patterns        Variable    Medium
   6      Session management       Manual      Low

Most people optimize from the bottom up. The best engineers start at the top.

What nv:context does

Interviews you about your tools, pain points, landmines, and workflow preferences
Scans your codebase with parallel subagents to find non-obvious patterns
Scores your setup on all six leverage layers (0-10 per layer, 0-60 overall)
Generates tailored configs for only the tools you actually use
Sets up hooks for deterministic enforcement
Creates session management infrastructure
Installs compounding engineering (optional GitHub Action)

Production proof

selectools (Python SDK, 4,612 tests)

Starting state: L3 maturity, 49/60 leverage score, 440-line CLAUDE.md.

After: L5-L6, 58/60. CLAUDE.md went from 440 lines to 67 (-85%). Token budget dropped 53%.

nichevlabs (multi-product SaaS)

Starting state: L4 maturity, 17/60 leverage score.

The smoking gun: a 805-line SESSION.md that got loaded on every session start. 17,000 tokens. On every conversation. nv:context's token budget report made it impossible to ignore.

After: L6, 49/60 (up 32 points). SESSION.md went from 805 lines to 59 (-93%). Saved 15,800 tokens per session. A parallel bug-hunt subagent surfaced 81 real bugs while it was analyzing the codebase.

sheriff (Python + TypeScript)

Already-strong setup. L4 maturity, 36/60 leverage score going in.

After: L5, 42/60 (+6 points). Smaller delta than the others. Incremental polish, not a rewrite.

The through-line across all three: the skill is not a template generator. Same methodology, radically different outputs.

What it works with

The generated AGENTS.md is read by 25+ AI coding tools including Claude Code, Cursor, GitHub Copilot, Aider, Codeium, Continue, Windsurf, Zed, Gemini CLI, Cline. Tool-specific files only get generated for the tools you actually use.

Install

npx skills add johnnichev/nv-context -g -y

Then open any project and run /nv-context. Three-minute interview, thirty seconds of parallel analysis, done.

The research library

Full research library: https://skills.nichevlabs.com/research

Full synthesis (10 laws, 4 operations, 7-component context stack): https://skills.nichevlabs.com/synthesis

Primary sources include Anthropic engineering blog, Google DeepMind research, OpenAI Agents docs, ETH Zurich agent config paper, METR controlled developer study, JetBrains NeurIPS 2025 paper, Manus production data, GitHub's analysis of 2,500 public AGENTS.md files, Boris Cherny, and Dex Horthy.

Honest caveats

Three production repos is a small sample.
60% token overhead on first run. First-run benchmark: 100% pass rate vs 45.8% baseline.
Research coverage is Python and JavaScript heavy. Rust, Go, Kotlin, and Elixir are thinner.
The skill is opinionated.

If you build AI coding agents for a living

Context engineering is the discipline that separates AI tools that work in demos from AI tools that work in production. If you have been writing longer CLAUDE.md files and things keep not quite working, try nv:context on your repo.

Repo: https://github.com/johnnichev/nv-context
Landing: https://skills.nichevlabs.com
Install: npx skills add johnnichev/nv-context -g -y

Two things launch alongside nv:context today. First, selectools, the Python agent framework I built that taught me I needed a methodology. Second, the landing page you just read about this methodology was built entirely with nv:design.

SELECTOOLS: Multi-agent graphs, tool calling, RAG, 50 evaluators, PII redaction. All in one pip install.

John Nichev — Tue, 07 Apr 2026 23:34:20 +0000

Releasing v0.20.1 of selectools, an open-source (Apache-2.0) Python framework for AI agent systems. Supports OpenAI, Anthropic, Gemini, and Ollama.

pip install selectools

The technical hook: how interrupts work after a human pause

LangGraph's interrupt() mechanism re-executes the entire node body on resume. This is by-design and falls out of LangGraph's checkpoint-replay model. The official guidance is to make pre-interrupt side effects idempotent, place expensive work after the interrupt() call, or split side effects into a separate downstream node. It works, but every node that needs human input has to be structured around the resume semantics. It's a leaky abstraction.

Selectools uses Python generators instead. The node yields an InterruptRequest. The graph resumes at the exact yield point via generator.send(). Expensive work runs exactly once, with no idempotency contortions required.

async def review_node(state):
    analysis = await expensive_llm_analysis(state.data["draft"])  # runs once
    decision = yield InterruptRequest(prompt="Approve?", payload=analysis)
    state.data["approved"] = (decision == "yes")  # resumes here

The v0.18.0 changelog documents the contrast directly: "Resumes at the exact yield point (LangGraph restarts the whole node)."

Multi-Agent Orchestration

AgentGraph is a directed graph executor for agent nodes. Routing is plain Python functions, no learned router, no DSL. This is deliberate: in production agent systems, you generally want deterministic control flow with LLMs doing the reasoning within nodes, not deciding the graph topology.

Key design choices:

ContextMode controls what history flows between nodes: LAST_MESSAGE (default), LAST_N, FULL, SUMMARY, CUSTOM. Prevents context explosion where downstream agents get drowned in irrelevant upstream conversation.
Parallel execution with MergePolicy (LAST_WINS, FIRST_WINS, APPEND) for fan-out/fan-in patterns.
Loop and stall detection via state hashing. The graph tracks whether state is actually changing.

SupervisorAgent

Four coordination strategies:

Strategy	Description	Best for
`plan_and_execute`	LLM generates a JSON plan, agents execute sequentially	Structured tasks
`round_robin`	Agents take turns, supervisor checks completion each round	Iterative refinement
`dynamic`	LLM router selects best agent per step	Heterogeneous tasks
`magentic`	Magentic-One: Task/Progress Ledgers + auto-replan	Autonomous research

The magentic strategy implements the Magentic-One pattern from Microsoft Research. ModelSplit lets you use expensive models for planning and cheap models for execution (70-90% cost reduction).

Built-in Eval Framework

50 evaluators ship with the library (no paid service required): 30 deterministic + 20 LLM-as-judge. Plus A/B pairwise comparison, regression detection, JUnit XML for CI, and HTML reports.

Engineering Rigor: Autonomous Bug Hunts + Pre-Launch Security Audit

The bug-hunting story is the part of this project I'm proudest of, and every claim below is in the public CHANGELOG.md.

v0.19.1 Ralph Loop Bug Hunt. Autonomous convergence system that runs 8 passes across all 7 modules until 3 consecutive clean passes. Result: ~90 bugs fixed and 254 new regression tests added (tests went from 2,664 to 2,918). Selected fixes from the changelog: _tool_executor.py ThreadPoolExecutor singleton for deadlock prevention, _provider_caller.py async observer events on LLM cache hits, _openai_compat.py tool call deltas flushed after stream end (Ollama compat), fallback.py mid-stream fallback corruption, bm25.py atomic snapshot under lock for concurrent clear/add safety, evals/llm_evaluators.py prompt injection fencing on user-controlled fields with <<<BEGIN_USER_CONTENT>>> delimiters.

v0.19.1 RAG Adversarial Bug Hunt. Eight edge-case fixes including ChromaVectorStore n_results clamping for empty collections, HybridSearcher None handling for vector_top_k/keyword_top_k, ContextualChunker prompt template validation, PDFLoader PdfReadError raised as ValueError for encrypted PDFs, BM25 top_k < 1 immediate validation.

Pre-Launch 5-Agent Parallel Security Audit (v0.20.0). 5 Claude subagents ran in parallel against the whole codebase, each focused on a different subsystem (concurrency, None guards, injection, path traversal, crash safety). 56 total findings, 9 critical security fixes shipped: score injection in eval extractors, ReDoS in custom regex, path traversal in ToolLoader, Anthropic multi-tool message merging, Redis session key collision, async output guardrails, Redis/Supabase error handling. Full audit is published in docs/SECURITY.md with every # nosec annotation reviewed individually.

Some of these patterns came from reading the LangChain, CrewAI, AutoGen, and LlamaIndex source while building the migration guides. The LangGraph HITL pattern (entire node restarts on resume) is the clearest example. Selectools uses Python generators instead, and the v0.18.0 changelog literally documents the contrast: "Resumes at the exact yield point (LangGraph restarts the whole node)."

Advanced Agent Patterns

Four high-level patterns ship in selectools.patterns:

Pattern	Description
`PlanAndExecuteAgent`	LLM generates a plan, executes subtasks sequentially
`ReflectiveAgent`	Self-critique loop with configurable quality threshold
`DebateAgent`	Two-agent adversarial debate + synthesis
`TeamLeadAgent`	Lead agent coordinates specialists with load balancing

Enterprise Hardening

Stability markers: @stable, @beta, @deprecated(since, replacement) decorators for public API signalling. Introspect via obj.__stability__.
Trace HTML viewer: trace_to_html(trace) renders any AgentTrace as a standalone waterfall HTML timeline. No JS framework, no external deps.
SBOM: sbom.json (CycloneDX 1.6) with all core production dependencies.
Compatibility matrix: Python 3.9-3.13 × provider SDK × optional deps in docs/COMPATIBILITY.md.

Serve & Deploy

selectools serve agent.yaml starts a Starlette ASGI server with a playground UI. Define agents in YAML, pick from 5 templates (customer_support, data_analyst, research_assistant, code_reviewer, rag_chatbot). Production additions: PostgresCheckpointStore, TraceStore (3 backends), compose() for tool chaining, retry() / cache_step() pipeline wrappers, type-safe step contracts, and streaming composition.

Tests + Coverage

4,612 tests (95% coverage) across Python 3.9-3.13, with real-API evaluations against OpenAI, Anthropic, and Gemini. Includes 28 Hypothesis property-based tests, 15 thread-safety smoke tests (10 threads × 20 ops with synchronized start), and 16 production integration simulations.

Also new in v0.20.x

An early visual agent graph builder at https://selectools.dev/builder/ (49KB self-contained HTML, exports to YAML or Python). Works but I'm still polishing edges, so pip install is the recommended path right now.

Links:

GitHub: https://github.com/johnnichev/selectools
Quickstart: https://selectools.dev/QUICKSTART/
Changelog: https://github.com/johnnichev/selectools/blob/main/CHANGELOG.md
PyPI: pip install selectools

Why I Built Selectools (and What I Learned Along the Way)

John Nichev — Tue, 07 Apr 2026 23:22:56 +0000

Every AI agent framework makes the same promise: "connect your LLM to tools and go." Then you start building.

You discover that LangChain needs 5 packages to do what should take 1. That LCEL's | operator hides a Runnable protocol that breaks your debugger. That LangSmith costs money to see what your own code is doing. That when your agent graph pauses for human input, LangGraph restarts the entire node from scratch.

I hit every one of these at work. We were building AI agents for real users, not demos, not prototypes, but production systems handling actual customer requests. The existing frameworks weren't built for this.

So I built selectools.

What I actually needed

Tool calling that just works. Define a function, the LLM calls it. No adapter layers, no schema gymnastics. Works the same across OpenAI, Anthropic, Gemini, and Ollama.
Traces without a SaaS. Every run() should tell me exactly what happened, which tools were called, why, how long each step took, what it cost. Not "sign up for our platform to see your own logs."
Guardrails that ship with the agent. PII detection, injection defense, topic blocking, configured once, enforced everywhere. Not a separate package to evaluate.
Multi-agent orchestration in plain Python. When I need 3 agents to collaborate, I want Python routing functions. Not a state graph DSL, not a compile step, not Pregel channels.
One command to deploy. selectools serve agent.yaml gives me HTTP endpoints, SSE streaming, and a chat playground. Not "install FastAPI, create an app, add routes, configure CORS, handle SSE..."

What selectools looks like today

A 3-agent pipeline is one line:

result = AgentGraph.chain(planner, writer, reviewer).run("Write a blog post")

A composable pipeline uses | on plain functions:

pipeline = summarize | translate | format
result = pipeline.run("Long article text...")

Human-in-the-loop pauses at the yield point and resumes there:

async def review(state):
    analysis = await expensive_work(state)  # runs once, not twice
    decision = yield InterruptRequest(prompt="Approve?")
    state.data["approved"] = decision == "yes"

Deploy with one command:

selectools serve agent.yaml

The numbers

4,612 tests at 95% coverage across Python 3.9-3.13
9 critical security bugs fixed in a pre-launch audit (5-agent parallel bug hunt, 56 total findings)
44 interactive module docs with runnable examples, stability badges, and Copy Markdown buttons
40 real-API evaluations against OpenAI, Anthropic, and Gemini
76 runnable examples
50 built-in evaluators (no paid service needed)
152 model definitions with pricing data
Apache-2.0 license

The latest milestone: a visual agent builder

The newest addition is a visual agent builder that runs entirely in your browser. Drag and drop nodes, wire up edges, configure models and tools, then export to YAML or Python. It's deployed on GitHub Pages at https://selectools.dev/builder/ with zero install required. No paid desktop app, no subscription. Just open the URL and start building.

What I'd tell you honestly

selectools is smaller than LangChain. The community is young. If you need 50 integrations and a managed platform today, LangChain is the safer bet.

But if you want a library that stays out of your way, where routing is a Python function, errors are Python tracebacks, and you don't need a paid service to see what your agent did, give it a try.

pip install selectools

GitHub: https://github.com/johnnichev/selectools
Docs: https://selectools.dev
Cookbook: https://selectools.dev/COOKBOOK/