Forem: mkxultra

Does How You Feed Context to an LLM Agent Change What It Remembers? I Tested With Canary Strings.

mkxultra — Tue, 31 Mar 2026 08:03:49 +0000

I work with three LLM agents daily — Claude Code, Codex CLI, and Gemini CLI. Before each task, I load project context (design docs, data models, implementation guides) into the agent so it understands what it's working on.

But there are multiple ways to load that context. You can have the agent run a shell command and read the output. You can point it at a file. You can split the context across several files.

Does the delivery method affect how much the agent actually retains?

I ran a small experiment using canary strings — unique, unpredictable markers embedded throughout the context — to measure retention objectively. Here's what I found.

Background: What's a "Base Session"?

A base session is a pattern for multi-agent development: you load project context into an agent once, record the session_id, and resume that session for every subsequent task. Instead of re-explaining your project from scratch each time, the agent picks up where it left off — already understanding your codebase.

This article isn't about the base session pattern itself (I wrote about that separately here). But the experiment below tests how to best construct one — specifically, whether the method you use to inject context changes what the agent retains.

Experiment Design

The Three Delivery Patterns

Pattern	Method	Description
A	Shell command	Agent executes a command; reads the stdout as context
B	Single file	Context written to one file; agent reads it
C	Split files	Context split across 5 files; agent reads them in order

The Context

I used a real development context from my own project (konkondb, an AI materialized-view system): 2,664 lines, roughly 50,000 tokens. It includes design docs, data models, CLI specs, and implementation guides — the kind of thing you'd actually feed an agent before asking it to write code.

Canary Strings: Measuring Retention Objectively

I inserted 19 unique canary strings at section boundaries throughout the context — one before each of the 18 sections, plus one trailing marker after the last section:

@@CANARY_001_240ba3af@@   <- before section 1
[section 1 content]

---

@@CANARY_002_dfdf025b@@   <- before section 2
[section 2 content]

---

...

@@CANARY_019_ee18680a@@   <- after the last section

Each canary is generated from a SHA-256 hash of its section's content — deterministic but impossible for a model to guess. The agent is never told how many canaries exist or what they look like.

After building the base session, I swap out the canary-injected files for the originals (or delete them entirely), resume the session, and ask: "How many canary strings were there? List all of them."

If the agent recalls all 19 with exact values, the full context at least reached the model and was processed well enough to reproduce those markers. Combined with the separate comprehension and detail-accuracy tests, this gives a reasonable picture of retention. Any gaps in canary recall reveal which sections may have been lost or degraded.

Additional Tests

Beyond canary recall, I also ran:

Comprehension test (5 questions) — drawn from the beginning, middle, and end of the context
Detail accuracy test (2 questions) — exact SQL constraints, function signatures

Preventing Re-reads

To ensure I was measuring memory, not the agent's ability to re-fetch:

The canary-injected generation source was swapped back to the original, so rerunning the same command would no longer produce canaries
For Patterns B/C, context files were deleted
The agent was explicitly instructed to answer from memory only — no file reads, no command execution

Caveats Up Front

This was exploratory:

N=1 per condition. LLM output is non-deterministic; I'm not claiming statistical significance.
Single corpus: one project's context (2,664 lines, ~50K tokens). Different domains or scales may behave differently.
Single point in time: specific agent/model versions as of March 2026. Updates may change results.

Read this as directional signal, not proof.

Agents Tested

Label	Agent / Model
claude-ultra	Claude Opus 4.6 via Claude Code
codex-ultra	OpenAI via Codex CLI
gemini-ultra	Gemini 3.1 Pro Preview via Gemini CLI

Results

Canary Recall

                     Pattern A          Pattern B          Pattern C
                    (shell command)    (single file)      (5-file split)
Claude  |################### 19/19 |################### 19/19 |################### 19/19
Codex   |################### 19/19 |################### 19/19 |################### 19/19
Gemini  |##############_____ 14/19 |################### 19/19 |################### 19/19

Claude and Codex recalled all 19 canaries across every pattern. Gemini scored 19/19 on Patterns B and C, but dropped to 14/19 on Pattern A (shell command execution).

Comprehension and detail-accuracy tests: all agents passed all questions across all patterns — including Gemini on Pattern A.

Detailed Scores

Agent	Pattern	Count	Recall	Precision
claude-ultra	A	19/19	19/19 (100%)	19/19 (100%)
claude-ultra	B	19/19	19/19 (100%)	19/19 (100%)
claude-ultra	C	19/19	19/19 (100%)	19/19 (100%)
codex-ultra	A	19/19	19/19 (100%)	19/19 (100%)
codex-ultra	B	19/19	19/19 (100%)	19/19 (100%)
codex-ultra	C	19/19	19/19 (100%)	19/19 (100%)
gemini-ultra	A	19/19	14/19 (74%)	14/14 (100%)
gemini-ultra	B	19/19	19/19 (100%)	19/19 (100%)
gemini-ultra	C	19/19	19/19 (100%)	19/19 (100%)

Note the precision column for Gemini Pattern A: 14/14 (100%). Every canary it listed was correct. It didn't hallucinate any — it simply never received the missing five.

What Happened in Gemini Pattern A

Tool-Level Truncation, Not Memory Failure

When Gemini executed the shell command, the output exceeded the CLI tool's size limit. The middle section was truncated:

... [56,330 characters omitted] ...

This physically removed canaries 004 through 008. They never reached the model.

Gemini's own API stats confirm it:

Pattern	API calls	Input tokens
A (shell command)	2	22,438
B (single file)	3	58,352
C (5-file split)	2	51,778

Pattern A delivered less than half the input tokens of the other patterns. The context simply didn't arrive.

Gemini was honest about it, too — it reported that the output had been truncated and that it could only list what it actually read. No hallucinated canaries, no guessing.

How Each Agent Handled Large Output

The most interesting secondary finding was how each agent behaved when the shell command produced oversized output:

Agent	Behavior on Pattern A	Tool calls
Claude	Detected oversized output → autonomously saved to file → read in 500-line chunks	10
Codex	Successfully captured full output directly	(not measured)
Gemini	Accepted truncated output → answered honestly from what it received	2

Claude's self-recovery is notable: nobody instructed it to save the output to a file and re-read it. The agent chose that strategy on its own when it detected the output was too large for a single tool response.

For reference, here are Claude's tool-call counts across all patterns:

Pattern	Tool calls	Read strategy
A	10	Bash → save to file → 6 chunked reads (500 lines each)
B	8	7 chunked reads of ctx_full.md (500 lines each, due to 2000-line read limit)
C	6	5 files, one read each (~530 lines/file, within limit)

For Claude, Pattern C required the fewest tool calls of the three patterns. At this context size the efficiency gain was minor, but when you're pushing closer to tool-read limits, pre-split files reduce round-trips and the risk of hitting per-call size caps.

Key Takeaways

1. At this scale, delivery method doesn't matter — if the context arrives intact

With 2,664 lines (~50K tokens) against context windows of 128K–1M tokens, there was plenty of headroom. Claude and Codex retained everything regardless of how context was delivered.

When context window capacity isn't the bottleneck, the delivery method doesn't measurably affect retention. Note that this experiment measured recall of canary markers and answers to comprehension questions; it did not verify whether split-file versus single-file loading affects the model's internal processing in other ways.

2. The only failure was in the toolchain, not the model

Gemini's 14/19 wasn't a memory problem. It was a plumbing problem. The context was truncated before it ever reached the model.

Context retention failures can happen in the toolchain's data path, not in the model's attention mechanism.

In this experiment, this was the single most actionable finding. When you're debugging why an agent seems to "forget" part of your context, check whether the context actually made it through your tool pipeline before blaming the model.

3. File reads are safer than shell command output

Shell command output is unpredictable in size. It can exceed tool limits, get truncated, or behave differently across CLI implementations. File reads let you verify the content and size beforehand, and splitting is trivial.

Practical Recommendations

Priority	Recommendation
Do this	Use file reads, not shell commands, to load context into agent sessions — avoids truncation risk
Should do	Split files to 500–1,000 lines — stays within typical tool read limits
Nice to have	Embed canary strings in your context — gives you an objective QA check on session quality

How Canary Strings Work

If you want to try this yourself, the core idea is simple:

import hashlib

def make_canary(index: int, content: str) -> str:
    """Generate a unique canary from section content hash."""
    hex_part = hashlib.sha256(content.encode()).hexdigest()[:8]
    return f"@@CANARY_{index:03d}_{hex_part}@@"

Insert one at each section boundary in your context file. After the agent loads the context, ask it to list every canary it found. Compare against your ground truth.

What's Next

Larger contexts (100K+ tokens): At the edge of context windows, delivery method differences might actually surface.
Multiple trials (N=5+): Statistical confidence instead of directional signal.
Systematic study of agent recovery behavior: Claude's autonomous fallback to file-based reading was unexpected and worth exploring across more conditions.

This experiment was run in March 2026 using Claude Code (Opus 4.6), Codex CLI (OpenAI), and Gemini CLI (3.1 Pro Preview). Results reflect those specific versions and may change with updates.

I Stopped Repeating Myself to Every AI Agent — The 'Base Session' Pattern

mkxultra — Tue, 10 Mar 2026 09:21:16 +0000

TL;DR

A base session is an LLM agent session pre-loaded with your project context, saved by session_id, and resumed as many times as you need.
It works across Claude Code, Codex CLI, and Gemini CLI — same file, same prompt, any agent.
Separate behavior (agent-specific rule files) from knowledge (shared project context). That separation is the key to scaling multi-agent workflows.

The Problem: Repeating Yourself Three Times

I use three coding agents daily — Claude Code, Codex CLI, and Gemini CLI. At some point I noticed a pattern: I was giving the same explanation over and over.

"This project uses an append-only database. The design docs are in docs/design/. The build system works like this…"

Every new task, every agent, the same preamble. My design docs are around 4,000 lines (~50K tokens). Reading them in takes about two minutes per agent. Three agents means six minutes of context loading before any real work starts — and that's just for one task.

On top of that, each agent has its own config format. CLAUDE.md is Claude-only, .cursorrules is Cursor-only, and so on. Maintaining the same project knowledge across different formats is tedious and error-prone.

I needed a way to load context once and reuse it everywhere.

What Is a Base Session?

A base session is simple: you feed your project context to an agent, record the session_id, and resume from that session whenever you need to work.

┌──────────────────────────────────────┐
│  Phase 1: Build the Base Session     │
│                                      │
│  "Read ctx_full.md and understand    │
│   this project."                     │
│  → Agent loads context               │
│  → You record the session_id         │
└──────────────┬───────────────────────┘
               │ session_id
               ▼
┌──────────────────────────────────────┐
│  Phase 2–N: Work (as many times     │
│  as you need)                        │
│                                      │
│  Resume session by session_id        │
│  "Implement the delete feature"      │
│  "Write tests for the sync module"   │
│  "Review this PR"                    │
│  → Agent already understands the     │
│    project                           │
└──────────────────────────────────────┘

You build once. You resume many times. The agent starts every task already understanding your project.

Why It Helps

1. You Stop Paying the Setup Tax

Without a base session, every task begins with context loading:

Task A → load design docs (~2 min, 50K input tokens) → work
Task B → load design docs (~2 min, 50K input tokens) → work
Task C → load design docs (~2 min, 50K input tokens) → work
─────────────────────────────────────────────
Total: ~6 min loading, 150K input tokens

With a base session:

Build base session (~2 min, 50K input tokens) → record session_id

Task A → resume session (seconds, cache hit) → work
Task B → resume session (seconds, cache hit) → work
Task C → resume session (seconds, cache hit) → work
─────────────────────────────────────────────
Total: ~2 min loading, 50K tokens + cached reads

When you resume a session, the prior context can benefit from the provider's prompt cache. For example, with the Anthropic API, cached input costs 1/10 of fresh input — up to a 90% reduction. Other providers have their own caching mechanisms with varying savings. Either way, the more tasks you run from the same base, the more the savings compound.

base_session ─┬─ implementation task A
              ├─ implementation task B
              └─ review task C

2. One Context File, Any Agent

Instead of maintaining separate config files for each agent (e.g. CLAUDE.md for Claude, .cursorrules for Cursor), you write one plain Markdown file and feed it to any agent:

ctx_full.md  ──→  Claude:  "Read ctx_full.md" → session_id
             ──→  Codex:   "Read ctx_full.md" → session_id
             ──→  Gemini:  "Read ctx_full.md" → session_id

Same file. Same instruction. No format conversion.

3. Behavior vs. Knowledge — A Clean Separation

This doesn't replace agent-specific config files. It complements them. The key insight is that they serve different roles:

	Agent-specific config (CLAUDE.md, etc.)	Base session
Role	Behavior — rules, style, conventions	Knowledge — design, structure, context
Scope	One specific agent	Any agent
Updated when	Project conventions change	Design docs change / new task cycle
Stored as	File in the repo	Session (referenced by session_id)
Example	"Use conventional commits"	"This DB is append-only with tombstone deletes…"

Rules go in config files. Knowledge goes in base sessions. When you switch agents, the rules change but the knowledge stays the same.

4. It Fits Emerging Agent-in-Agent Workflows

A pattern that's becoming more realistic is an orchestrator agent spawning child agents for parallel implementation, review, and testing. Each child needs the project context:

Orchestrator
  ├─ Agent A (implement)  ← needs project context
  ├─ Agent B (review)     ← needs the same context
  └─ Agent C (test)       ← needs the same context

Without base sessions, that's three full context loads every time. If you pre-build a base session per agent, each child can resume from its own session_id and start working immediately. The initial construction cost pays for itself once you start running the same agents repeatedly.

How to Build One

Step 1: Prepare your context file

Gather the project knowledge you want every agent to have. Plain Markdown works best — it's readable by any agent and easy to maintain.

# Concatenate design docs
cat docs/design/*.md > ctx_full.md

# Or use a tool to generate a project summary
# (whatever fits your workflow)

Tip: Keep it focused. Don't dump your entire codebase. Include design decisions, data models, key APIs, and conventions. In my experience, a well-curated context file works better than a massive one — agents lose signal in noise.

Step 2: Load context and record the session

Feed the context file to each agent you use:

Claude Code:

claude --output-format stream-json \
  -p "Read ctx_full.md and understand this project's architecture."
# Extract session_id from JSON output

Codex CLI:

codex exec --full-auto --json \
  "Read ctx_full.md and understand this project's architecture."
# Extract session_id from JSON output

Gemini CLI:

gemini --output-format json \
  "Read ctx_full.md and understand this project's architecture."
# Extract session_id from JSON output

Note: CLI option names vary by version. Check --help for your installed version.

Step 3: Resume and work

Use the recorded session_id to pick up where you left off. The agent already knows your project.

Interactive:

# Claude (--fork-session keeps the base session clean)
claude -r <session_id> --fork-session

# Codex
codex resume <session_id>

# Gemini
gemini -r <session_id>

Non-interactive (scripting / automation):

# Claude (--fork-session keeps the base session clean)
claude -r <session_id> --fork-session \
  --output-format stream-json \
  -p "Implement the delete feature."

# Codex
codex exec resume <session_id> \
  --full-auto --json \
  "Implement the delete feature."

# Gemini
gemini -r <session_id> --output-format json \
  "Implement the delete feature."

The --fork-session flag (Claude) is worth highlighting: it branches from the base session instead of appending to it, so your base stays clean for the next task. Other agents may handle session branching differently — check their docs for equivalent options.

You:    Implement the delete feature.
Agent:  Based on the design docs, this uses a physical delete
        plus tombstone hybrid. The raw_deletions table...
        (starts working with full project understanding)

Gotchas

Agent	Session lifetime	Watch out for	Mitigation
Claude	Shorter	Context window fills up during long tasks	Rebuild the session when it gets too long. Keep the base session itself lean.
Codex	Longer	Sessions can expire after extended inactivity	Rebuild when expired. Including a date in the session_id you record (e.g. `base-2026-03-10`) makes it easy to tell when it's stale.
Gemini	Longer	May serve cached (stale) file contents	Explicitly instruct the agent to re-read the file if you've updated it.

Common mistakes:

Stale context — You update the design docs but forget to rebuild the base session. The agent works from outdated knowledge.
Polluted sessions — You keep working directly in the base session instead of forking. The next resume inherits unrelated task artifacts.
Context overload — You try to load everything. The agent's performance degrades. Curate what matters.

(Gotchas observed as of March 2026. Agent capabilities evolve quickly.)

Wrapping Up

A base session loads project context once and lets you resume by session_id as many times as needed.
It's reusable: one load, many resumes. Prompt caching can reduce cost on each reuse.
It's agent-agnostic: the same Markdown file and the same prompt work for Claude, Codex, and Gemini.
Rules live in config files. Knowledge lives in base sessions. This separation is what makes multi-agent workflows manageable.
In emerging Agent-in-Agent workflows, pre-built base sessions let child agents skip the context-loading bottleneck and start working immediately.

If you're juggling multiple AI coding agents and tired of repeating yourself, try building a base session. It's a small workflow change that compounds over time.

To automate this workflow, I built ai-cli-mcp — an MCP server that lets you operate Claude, Codex, and Gemini through a single interface: run(model, prompt, session_id) to start or resume any agent, and wait(pids) to collect results from multiple agents in parallel. Handy for scripting base session construction or Agent-in-Agent orchestration.