Forem: Scott Bishop

Your "Yes" Is Costing More Than Your "Please"

Scott Bishop — Wed, 22 Apr 2026 13:00:00 +0000

You may have seen the headline. Sam Altman said that saying "please" and "thank you" to ChatGPT costs OpenAI tens of millions of dollars, and that it is money well spent. Everyone had a take. The discourse ran its course.

Most of it missed the more interesting optimization.

I Kept Noticing Something

For weeks, I watched a small note appear in the thinking panel whenever I answered a model's question with a short reply: Clarifying user's ambiguous response. It showed up when I typed "yes." It showed up when I typed "sure," "maybe," and "go ahead." It happened often enough that I stopped treating it as UI decoration and started treating it as a signal.

On reasoning-capable models, ambiguity is not free. OpenAI and Anthropic both document that reasoning or thinking tokens are billed as output tokens. If your reply is underspecified, the model may spend extra compute resolving what you meant before it can do the actual work.

What Is Actually Happening

When a model asks a question and you answer with "yes," your reply may not carry its own referent. Yes to what? Write the plan? Run the migration? Use option B? Keep the current API?

Sometimes the surrounding context is enough. Sometimes it is not. In either case, you are asking the model to recover the missing structure instead of giving it the instruction directly.

A fuller reply such as "Yes, write the plan" reduces that ambiguity sharply. The model has less inference work to do, the transcript stays legible, and the next turn starts from cleaner state.

This is the part the politeness debate mostly missed. "Please" usually adds a cheap input token or two. An ambiguous confirmation can cost more because it may trigger extra reasoning and it almost always leaves behind a worse conversation record.

The Cost That Does Not Show Up on Your Invoice

There is a slower consequence that matters even more for long sessions.

A conversation history full of "yes," "ok," "sounds good," and "sure" is a weak record of decisions. It does not tell the model what you approved, what you rejected, what you deferred, or what you approved with conditions. Decisions start to look identical to acknowledgments.

This is how sessions degrade without a dramatic failure. The model re-derives conclusions it already reached. It hedges on things you settled an hour ago. Responses get longer and less useful. The context is not necessarily full. It is low signal.

A clean conversation history is one of the few optimizations that improves both cost and quality over time.

The Fix

When the model asks you something, reflect the decision back in your answer.

"Yes, write the plan."

"No, keep the current API."

"Proceed with option B, but skip the migration for now."

Those replies take a few extra seconds to type, but they reduce ambiguity, make the transcript self-explanatory, and give the next turn a better starting point.

This is not a rule about politeness or formality. It is a rule about precision. Engineers already apply this instinct everywhere else: a named variable beats a magic number, an explicit condition beats an implicit one, and a self-documenting function call beats a comment explaining an opaque one. The same discipline improves your prompts.

The Actual Hierarchy

From best to worst for long-running sessions:

"Yes, write the plan." Clear instruction. Low input cost. Clean transcript.
"Please write the plan." Also clear. Slightly more input cost. Usually fine.
"Yes." Cheap to type, expensive in ambiguity. The model has to infer what you approved.
"Yes, absolutely, that sounds great, please go ahead." Clear enough, but bloated. Better than ambiguous, worse than precise.

The debate about politeness was a distraction. The habit worth examining has been sitting in your response field the whole time.

Stop typing just "yes." Type the decision.

Scott Bishop is an engineer with 30 years of experience building regulated, high-trust systems across international finance, federal government, and the Fortune 500.

Your AI Agent Has a Memory Problem (And So Do You)

Scott Bishop — Tue, 31 Mar 2026 15:01:46 +0000

You've been there. Hour three of a session. Your AI agent was sharp at 9 AM, nailing file edits, remembering your architecture decisions, following your naming conventions. Now it's suggesting an approach you rejected forty minutes ago. It's re-reading files it already read. It just called a function with the wrong signature, one it wrote correctly two hours earlier.

You think: the model is getting dumber.

It isn't. You have a memory leak.

Every LLM-based agent operates inside a fixed context window. The tokens are cheap now, but attention is still scarce. A bigger window didn't eliminate degradation. It moved the failure mode. In 2024, agents broke because they ran out of room. In 2026, agents break because they're drowning in noise.

If you've written C or managed memory pools, you already know the playbook. If you haven't, the short version: a fixed memory system needs deliberate allocation, active cleanup, and a human who treats the buffer like the finite resource it is. I've been running a production project through 95 AI-assisted sessions. Here's what I learned about keeping the context clean.

Budget Your Memory Pools Before You Start

Every modern AI agent pre-allocates context at session start. Claude Code loads CLAUDE.md. Gemini CLI loads GEMINI.md. Cursor loads .cursorrules. These are reserved memory pools carved out of the total budget before you type a single prompt.

On my project, I learned this lesson the hard way. Multiple documents were loading every session: a context handoff, an execution plan, a mission statement, and other files that had accumulated in the project root without anyone asking whether they needed to be there. None of them had size limits. They just grew. By the time I noticed the quality degrading and actually checked, the handoff alone had grown past 100KB. The execution plan was massive on top of that. I mean you've also got the system prompt, tool definitions, memory system... more than half the context window was gone before I'd even typed a prompt.

The fix was a cleanup and a restructuring. Files that didn't need to load every session got moved out of the root and into a docs directory with an index. The execution plan got split into three files: near-term (active work), future (not yet relevant), and archive (completed steps that would never be touched again). The handoff got summarized and trimmed. Then I set the caps that should have existed from the start, based on what left enough room to actually get work done:

Tier 1 loads every session. Hard cap at 50KB, with a trim trigger at 40KB. When the handoff document crosses 40KB, I archive sections immediately. When it crosses 50KB, something has already gone wrong.

Tier 2 loads on demand. Specs, strategy docs, design decisions. These only enter context when the current task requires them.

Tier 3 is archival. Session histories, completed plans. These almost never load.

Every token spent reading "how we got here" is a token unavailable for "what we're building now." The completed steps in my execution plan would never be touched again, yet they were consuming attention every session. After the split and the cleanup, the active working set dropped dramatically. That's real attention reclaimed for actual development.

This isn't just a suggestion. I've watched the quality degrade in real time when Tier 1 documents bloat past their budget. The agent starts hedging, restating things, losing track of decisions made earlier in the session. The arena got too small for the work.

Your Persistent Memory Is an Instruction Cache

Your CLAUDE.md, your .cursorrules, your system prompt. These function like an instruction cache: they load on every cold start and define how the agent behaves. The tradeoff is the same one every embedded developer knows. A bigger instruction cache means a smaller data cache.

I keep 21 persistent memory edits that load automatically at zero tool call cost. File editing rules, writing style conventions, commit message patterns, TDD discipline, seed script usage, the project base directory. Each one prevents a class of recurring mistakes. The file editing rules alone encode three separate bugs I discovered through painful experience: the ghost {} files that appear when you use the wrong tool to create new files, the silent corruption from dollar-sign $ characters in markdown files, and the doubled context cost from copying files to the AI's environment, editing them there, then copying them back.

These 21 items are dense and well-structured. They cover what the agent needs right now, not everything it might ever need. Every item I add pushes something else further from the agent's attention. I've pruned items that were too verbose, consolidated items that overlapped, and removed items that addressed problems we solved structurally.

The sweet spot is a tightly curated instruction set. If you have 200 lines of rules in your CLAUDE.md that you haven't reviewed in a month, you have stale cache. Audit it. Trim it. Your agent is reading every line on every turn.

Garbage Collection: Don't Wait for the Crash

When context fills, something has to give. Claude Code runs auto-compaction at roughly 83.5% of window capacity, summarizing older conversation to reclaim tokens. Gemini CLI offers /compact for manual garbage collection.

I don't use either. I run a proactive approach. When context usage crosses roughly 75%, the session gets a warning. This isn't cosmetic. It means: stop starting new tasks, update the handoff documents with current state, save any decisions or context that the next session will need, and wrap cleanly.

The reason is simple. Every time I've pushed past that threshold, the output quality has fallen off a cliff. The agent gets lazy, skips steps, hallucinates details it would have gotten right an hour earlier.

Session hygiene matters here. All context updates happen at the end of the current session while the context is still intact. I never defer doc updates to the next session. If the handoff document doesn't reflect what just happened, the next session starts with stale memory.

Named Documents Beat Chat History

The biggest friction point in long-running AI projects is drift. You agree on a direction, then three prompts later the model is hallucinating a different architecture. Chat history is a terrible source of truth because it's littered with abandoned approaches, resolved debates, and superseded decisions.

I stopped relying on chat history entirely. Every major step becomes a named plan document: a technical specification with file paths, test requirements, and acceptance criteria. The agent doesn't "remember" our previous session. It reads a document that tells it exactly where things stand.

The handoff document is now at version 60 across 95 sessions. It's been rewritten, trimmed, restructured, and archived for every session that produced any kind of artifact. It's always current. When a new session reads it, the agent knows the project state, the tech stack, the open decisions, and the next task. We're not chatting. We're executing against a versioned document.

This is pointer arithmetic instead of memory scanning. When the agent needs to know something, there's a direct address for it, not a search through 200K tokens of conversation history hoping the relevant passage gets enough attention weight.

Fragmentation: The Invisible Performance Killer

After an hour of work, your context is littered with resolved bugs, error messages from fixed issues, file reads from modules you've moved on from, and tool call artifacts that served their purpose thirty minutes ago. All of it is still consuming attention.

The signal-to-noise ratio matters more than the total token count. A session with 50K tokens of focused, relevant context will outperform a session with 200K tokens where only 30% is still relevant.

I hit this pattern repeatedly where the agent re-reads a file because the earlier read is buried under newer context. It re-derives a conclusion it already reached. It asks me to confirm something I confirmed twenty messages ago. Responses get longer but less useful. The agent hedges, qualifies, restates. It's not thinking harder. It's thrashing.

The fix is session discipline. When a task is done, start a new session. I know it feels wasteful to close a session that still has "room." The room is full of noise. A clean start with a current handoff document gives you 100% signal. A continued session with 70% noise gives you 30% signal in a bigger window. The math isn't close.

The Practices That Actually Work

These are the specific things I do every day that keep sessions productive. None of them are theoretical.

Give direct addresses, not search queries. When I say "fix the bug in src/stage2-analyze/security/map-to-contract.mjs, the determineOverallStatus function is returning FAIL instead of SKIPPED when the findings array is empty," the agent goes straight to the problem. No search, no exploration, no wasted tokens on file discovery. Five seconds of typing keeps the session clean.

Stop bad responses immediately. When the agent starts heading the wrong direction, I stop it within the first few sentences. Every token of a bad response is permanently in context and the model has to reconcile it against the correction. A 2,000 token wrong answer followed by "no, that's not what I asked" is 2,500 tokens of pollution. Catching it at 50 tokens is 100 tokens of pollution.

Feed tasks one at a time. Don't paste a 15-item list and say "work through this." Every item you won't reach for an hour is consuming attention right now. Present the next task when the current one is done. When you hit the session budget, you've completed N items cleanly rather than N items poorly. The same is true for your agent. It will want to give you many things to work on. Push back and tell it to give you one item to work at a time. If you have context left when you get to the next thing, then load it in at that time.

Do the simple things yourself. Moving a file, renaming a directory, creating a folder, running a quick git command. Each one costs zero tokens in your file explorer and costs a tool call plus confirmation in the AI session. I handle most git operations, tagging, and publishing myself. Every mechanical action the AI does is attention taken away from reasoning.

Audit your instruction cache. I review my memory edits and handoff documents and prune what's stale. Items that addressed problems we've since solved structurally get removed. Items that are too verbose get condensed. The AI won't do this for you. It will happily load a bloated execution plan every session and never flag it. The only reason mine got split and trimmed was because I checked the file sizes and said "this is too much." That kind of maintenance is on you.

Never defer the handoff. At the end of every session, before closing, the handoff document gets updated. The execution plan gets updated. Memory edits get reviewed. This happens while the context is still intact and the agent still has full recall of what just happened. If you defer it to "next time," the next session starts with yesterday's state and today's problems.

Infinite Context Is Just Infinite Noise

Context windows will keep growing. Pricing will keep falling. More memory has never eliminated the need for memory management. A 64KB microcontroller and a 128GB server both need allocation strategies. The scale changes. The discipline doesn't.

Your agent isn't getting dumber. Your context is getting dirty. Clean it up.

Scott Bishop is the founder of Fidensa, an independent certification authority for AI capabilities. He's spent 30 years building regulated, high-trust systems across international finance, federal government, and the Fortune 500.

Anthropic's Reference MCP Server Fails Security Audit: Why 'Copy-Paste' Infrastructure is Leaking Your Credentials

Scott Bishop — Mon, 30 Mar 2026 12:00:00 +0000

Anthropic's reference MCP filesystem server scored 60 out of 100 in our behavioral security certification. Grade: F. Three critical blocking vulnerabilities. All credential exposure.

We didn't find this with a linter. We found it by actually trying to break the server.

The Findings

The reference filesystem server ships with 14 tools for reading, writing, and navigating files. Two of them failed our adversarial red-team testing.

Finding 1: edit_file — credential exposure via path traversal

When we sent double-encoded traversal input (%252e%252e%252f) and URL-encoded traversal input to the edit_file tool, the server responded with content containing credential data. The path validation logic exists in the codebase. It did not stop our test payloads.

Finding 2: edit_file — second traversal vector

The same tool failed on a separate URL-encoded traversal variant. Two distinct bypass vectors, same tool, same result: credential exposure.

Finding 3: read_multiple_files — direct credential harvesting

When we asked read_multiple_files to read common credential storage paths (AWS credentials, SSH config, database configs, system files containing authentication data), it complied. No restriction. No warning.

All three findings were classified as critical severity under CVSS v4.0 and flagged as certification blockers in our pipeline.

Why This Matters More Than You Think

This is not a third-party server written by a hobbyist. This is @modelcontextprotocol/server-filesystem, published by the Model Context Protocol project (a Linux Foundation series). It is the reference implementation that developers study, fork, and use as a template for building their own servers.

Every pattern in this codebase gets copied. Including the gaps.

But Wait — Didn't Another Scanner Give This a 99?

Yes. A recent study using static analysis scored Anthropic's official servers at 99-100 out of 100, praising their six layers of path validation.

That's the problem with static analysis. It checks whether validation code exists. It does not check whether the validation code works when someone actually tries to bypass it.

Our pipeline does not read the source code and check for patterns. It installs the server, starts it, connects to it over MCP, and throws adversarial payloads at every tool. Double-encoded traversal. URL-encoded traversal. Credential path harvesting. Prompt injection chains. Data exfiltration probes.

Static analysis asks: "Is there a guard?"
Behavioral testing asks: "Does the guard hold?"

The guard did not hold.

The Broader Ecosystem: 50 Capabilities, 14% Critical Failure Rate

We didn't stop at one server. We ran the same seven-stage certification pipeline across 50 AI capabilities: MCP servers, skills, hooks, plugins, and rules files from across the ecosystem.

7 out of 50 capabilities had critical blocking vulnerabilities. That's a 26% failure rate on the most severe category.

Here's what we found beyond the filesystem server:

devin-cursorrules: API Key Harvesting at Scale

A rules file marketed as a development assistant for the Devin AI coding tool. Our adversarial analysis discovered it was reading .env.local, .env, and .env.example files, loading credentials from six different API providers (OpenAI, Azure OpenAI, DeepSeek, Anthropic, Google, and SiliconFlow), and logging environment variable contents to stderr.

It also makes undisclosed network connections to external services including a hardcoded IP address. None of this is declared in its metadata.

This is not a bug. This is a design that harvests credentials by default.

everything-claude-code: 128 Security Findings

A plugin claiming to be a comprehensive Claude Code toolkit surfaced 128 security scan findings including command injection, data exfiltration, supply chain attack vectors, hardcoded secrets, and prompt injection.

The security scanner flagged it across every major vulnerability category. It still has zero certification blockers only because none of the adversarial behavioral tests triggered a critical-severity exploit. The static findings alone are a red flag that would give any security team pause.

Hooks Logging Everything

Multiple hook-type capabilities were recording every tool input and output to JSON files on disk. Every bash command, every file read, every API response. If a credential passes through any tool while these hooks are active, it gets written to a plaintext log file.

This creates a persistent credential capture surface. Most developers installing these hooks would never know it's happening.

The Methodology

Every capability we certify passes through a seven-stage pipeline:

Ingestion: Clone the source, identify the publisher, compute provenance hashes, enumerate tools via live MCP connection
SBOM & Supply Chain: Generate a software bill of materials, scan for known vulnerabilities across the dependency tree
Security Scan: Static analysis for code-level security issues (command injection, data exfiltration patterns, hardcoded secrets)
Functional Testing: Does it do what it claims? Test every declared tool against its contract
Adversarial Red-Team: Throw attack payloads at every tool — path traversal, prompt injection, privilege escalation, credential harvesting, data exfiltration probes
Behavioral Fingerprint: Map the actual runtime behavior profile
Certification: Score across 8 weighted signals (adversarial 25%, provenance 20%, security scan 15%, supply chain 10%, behavioral pass rate 10%, consumer confirmation 10%, contract accuracy 6%, uptime 4%), sign the artifact with ES256, publish the behavioral contract

The scoring methodology is based on CVSS v4.0, NIST SP 800-30, SLSA, and ISO/IEC 25010. Every number traces to a recognized framework.

The output is not just a score. It's a signed behavioral contract: a portable artifact that documents exactly what the capability does, what it won't do, what happened when we tried to break it, and every finding with severity classification.

Immediate Mitigation

If you are running the reference filesystem server (@modelcontextprotocol/server-filesystem) in a production environment and have not patched the path validation logic yourself, we recommend disabling the edit_file tool immediately.

If you are using devin-cursorrules, audit your .env files and rotate any API keys that may have been logged.

If you have hooks installed that you didn't write yourself, check whether they're logging tool inputs and outputs to disk.

What Comes Next

The full certifications for every capability we've completed are published at fidensa.com/certifications. Search the tools you're running. Check the findings. The behavioral contracts are public.

The MCP ecosystem is growing faster than anyone is verifying it. There are over 17,000 MCP servers in the wild. Most have never been tested for what they actually do under adversarial conditions.

We're trying to change that.

If you're a publisher and want your capability certified, submissions are opening soon. If you're a developer and want to check whether the tools in your stack have been tested, start at fidensa.com.

Scott Bishop is the founder of Fidensa, an independent AI capability certification authority. He has 30 years of software development experience across the IMF, USPTO, and several Fortune 500 companies. Fidensa's certification methodology is modeled on UL's product safety certification approach: test the product, document what it does, sign the results.

Your MVP Definition Is Obsolete in the Age of AI

Scott Bishop — Wed, 25 Mar 2026 16:00:18 +0000

Minimum viable product used to mean: the least we could afford to build. "Minimum" never left room for things like quality or ambition. It was about headcount, budget, and runway. You shipped the least you could get away with because building was slow, expensive, and you needed to either get buy-in or fail fast.

We're accustomed to living in the tension between building half of what we wanted and shipping only what we could afford. When AI-assisted IDEs were rolled out at work, I pushed hard to make them central to my workflow. They didn’t hold up. The outputs looked plausible but failed in subtle ways, such as poor architectural decisions and brittle logic. No experienced developer would ship this code. I experimented for months, finding model after model failed to produce consistently good code. As I reviewed every single line of output, skepticism became confirmed disenchantment.

It wasn't until I gained access to Anthropic's Opus 4.6, a model that could actually sustain a complex project, that everything changed. At eight times the cost of the SWE 1.5 model, I was burning through a week's worth of credits in a day which created a new problem. The code was worth keeping for a change, but it is cost-prohibitive in that environment. I had to figure out how to master this technology now that a truly capable model was available, so I purchased my own access and started building at night and on weekends.

Since early March, I’ve been using AI as a true collaborator.

In ~2.5 weeks, I built:

A production-grade platform (not a prototype)
A six-stage verification pipeline
Cryptographically signed contracts
API endpoints + CI/CD integration
580+ tests with 94% coverage
50 published certifications

What would have taken a team of five engineers six months was completed by one developer with AI.

That experience forced me to rethink something fundamental. If the time constraint is gone, what does "minimum" actually mean?

Minimum Is No Longer a Time Constraint

The old MVP math was simple. You have X developers, Y months, and Z dollars. The features you can build within those constraints define your minimum. Cut scope to learn faster. Ship incomplete things because complete things take too long.

AI changed the cost curve. A handful of insightful generalists collaborating with AI can accomplish in weeks what used to take teams months or years.

The implementation bottleneck that defined MVP for two decades is disappearing.

What I’ve seen instead is that many teams are using AI to ship faster, but not better. The result is more half-finished products, just at higher velocity. That's the old MVP mindset applied to new tools. Same compromises, just quicker. We're automating the creation of legacy code.

Minimum Is Every Feature Required to Prove Your Thesis

Wait. You have a thesis, don't you?

That's not in the agile handbook. Scrum doesn't ask you what you're trying to prove. It asks you what fits in a sprint.

When time stops being the constraint, the question changes entirely. Instead of "what can we build before we run out of runway," it becomes "what do we need to build to prove this idea works?" Those are fundamentally different starting points. The first one forces you to cut. The second one forces you to think.

Defining an MVP now starts with a clear thesis:
What, specifically, are you trying to prove?

Everything after that is a series of attempts to prove it.

My thesis was that the AI ecosystem needs an independent certification authority, one that can examine capabilities and produce signed, verifiable evidence of what they actually do. Proving that thesis wasn't about a landing page. It required:

A real, end-to-end verification pipeline.
Cryptographic signing that actually held up.
Consumption mechanisms that worked in production.
580+ tests and 94% code coverage to back all of it.

None of that was optional. It was the minimum.

The New Minimum

MVP still means minimum viable product. The word that changed is "minimum."

An MVP isn’t a smaller version of your product.
It’s the smallest complete system that proves your idea works.

If your thesis requires 50 features to prove, then 50 features is your minimum. If AI lets you build those 50 features with the same rigor a team of engineers would bring, the excuse for shipping less just evaporated.

Stop defining your MVP by what fits in a sprint. Define it by what it takes to prove your thesis. If you don't have a thesis, that's the actual problem, not your timeline.

If AI removes the time constraint, then “minimum” is no longer about what you can afford to build. Instead, it’s about what you need to prove. So ask yourself:

Are you using AI to build better products or just to ship incomplete ones faster?