The Asymmetry of Change: Why Your Tests Are Looking the Wrong Way

GauntletCI — Sat, 09 May 2026 01:27:41 +0000

The Asymmetry of Change: Why Your Tests Are Looking the Wrong Way

A passing build is often treated as a certificate of correctness. In reality, it's a narrow contract.

It doesn't prove your code is right. It proves that the assertions you wrote in the past, against behaviors you anticipated back then, still hold true today.

When you open a pull request, your unit tests ask: "Does the system still behave the way it used to?"

The question you actually need to answer is different: "Is the new behavior I just introduced safe?"

Those aren't the same thing. And that gap is exactly where production incidents live.

The Wrong Question

Here's the problem: tests are a snapshot of past understanding.

Your code changed. Your tests didn't. And somehow the build is still green.

A guard clause disappears. No test explicitly covered it because the guard was the coverage. A condition gets narrowed. An exception handler gets swapped. A state transition loses a validation step.

The test suite sees none of this, because it was never asked to care about these things. It was asked about something else. Something that still works fine.

The Evidence

This isn't theoretical. Multiple independent studies have found the same pattern across every major programming language.

Test Co-Evolution Studies:

A 2025 study analyzed 526 repositories across JavaScript, TypeScript, Java, Python, PHP, and C#. Finding: asynchronous evolution of tests and code is pervasive. [1] Earlier work on 975 Java projects reached the same conclusion: production code frequently changes without test updates. [2] This has been documented since at least 2010. [3]

Chromium CI Study:

Researchers analyzed 1.5 million test executions across 14,000 commits. Result: even with 99.2% precision, modern flakiness detection still caused 76.2% of real regression faults to be missed. [4] Not because tests were missing. Because the tests that existed were being silenced.

Real Example - Django 6.0:

A refactor in the querystring template tag introduced a loop that worked fine for standard dictionaries but silently broke QueryDict instances. Existing tests passed. The bug shipped. It was caught only by a targeted rendered-output test that nobody thought to run regularly. [5]

The Numbers:

In an analysis of 598 pull requests across 57 open-source .NET repositories, 71% of PRs submitted without test file modifications contained at least one behavioral risk indicator. [6] That's not an outlier. That's the norm.

The Time Machine Problem

Every diff is a time machine moving in one direction.

The assertions stay where they were written. The code underneath moves forward.

// Before: implicit contract
if (user == null) return; 
Process(user.Name);

// After: contract broken, tests don't notice
Process(user.Name);

The guard was always there. Because it was always there, nobody wrote a test for the null case. It was implicit in the structure. The contract was protected by accident.

Remove that guard, and the test suite stays green. It's not "broken." It just never knew the guard mattered.

This is the Implicit Contract problem. And it's everywhere.

Why Code Review Isn't Enough

We rely on code review to catch these slips.

But human reviewers have a context window too. On a Tuesday afternoon, looking at a 400-line diff, they might see a refactor and miss that a crucial exception handler got swapped or a validation step disappeared.

We are asking humans to perform high-stakes pattern matching against a moving target. It's a process designed for fatigue.

Plus: reviewers didn't write the original code. They don't carry the full behavioral contract in their head. The removed guard clause looks like cleanup. The narrowed condition looks like a legitimate business rule change.

Code review is essential. But it's not a safety net. It's a second pair of eyes that also gets tired.

The Deterministic Answer

Here's what actually works: catch these patterns before anyone else sees the code.

Not with an LLM that sometimes forgets what you told it thirty messages ago. Not with probabilities. With deterministic rules that fire the same way every single time.

A Roslyn-powered engine that scans your diff and flags:

Removed guard clauses or defensive conditions
Narrowed catch blocks (catch(Exception) → catch(ArgumentException))
Validation steps removed from state transitions
Thread-blocking patterns introduced in async code (e.g., new Thread.Sleep())
Behavioral changes that touch no test files

Each of these is a pattern that has caused real production incidents. Each can slip past a green test suite.

The output is a checklist, not a verdict. You still decide what's actually a risk and what isn't. But you decide with full information, at the moment of change, when the logic is still fresh in your head.

Moving the "Uh-Oh" Moment

The most expensive place to have an "uh-oh" moment is in a post-mortem.

The second most expensive is a failed staging build.

The goal is to move that realization to your local terminal. The millisecond you hit save. Before you even think about committing.

When you catch unvalidated behavioral changes while the code is still in front of you, you don't just keep the build green. You ensure the build is actually correct.

You stop the time machine before it leaves the station.

What's Next

If this problem feels familiar, you've already felt the cost of it.

The question isn't whether these gaps exist. The evidence is clear: they're everywhere. The question is whether you want to keep finding them in production, or find them at the diff.

Learn more:

References

[1] Miranda, J. et al. (2025). Test Co-Evolution in Software Projects: A Large-Scale Empirical Study. Journal of Software: Evolution and Process. DOI: 10.1002/smr.70035

[2] Sun, W. et al. (2021). Understanding and Facilitating the Co-Evolution of Production and Test Code. IEEE International Conference on Software Engineering (ICSE).

[3] Gergely, T. et al. (2010). Studying the co-evolution of production and test code in open source and industrial developer test processes through repository mining. Empirical Software Engineering. DOI: 10.1007/s10664-010-9143-7

[4] Haben, G., Habchi, S., Papadakis, M., Cordy, M., & Le Traon, Y. (2023). The Importance of Discerning Flaky from Fault-triggering Test Failures: A Case Study on the Chromium CI. arXiv:2302.10594.

[5] Moreau, M. (2026). How a Single Test Revealed a Bug in Django 6.0. Lincoln Loop.

[6] Cogen, E. (2025). GauntletCI Corpus Analysis. 598 pull requests across 57 open-source .NET repositories.

Eric I. Cogen builds software for production. Twenty years in .NET, twenty years of shipping bugs that tests never caught. GauntletCI is the pre-commit gate he wishes he'd had all along.

Post-Mortem: How a "Performance" PR Introduced 28 New Regressions

GauntletCI — Fri, 08 May 2026 01:53:14 +0000

Analyzing Jellyfin PR #16062 with GauntletCI

Jellyfin PR #16062 is titled "Query Performance Improvements." It was a massive architectural shift: 126 files, 27,810 lines of code. It was reviewed, approved, and merged on May 3, 2026.

By May 7, the community was already reporting 90-second query hangs (Issue #16279).

We ran GauntletCI: a deterministic, rules-based Behavioral Change Risk (BCR) detector: against the merged diff. It took exactly 660 ms to find exactly why the "Performance Improvement" was causing performance degradation.

The Gap Between Intent and Reality

In a 27,000-line diff, human review is a suggestion, not a safeguard. The maintainers intended to fix N+1 query patterns. They succeeded in some areas, but the sheer scale of the change made it impossible to see what was being introduced simultaneously.

1. The Performance Traps (GCI0044)

Findings: 28 LINQ-in-loop patterns.
The Reality: While the PR closed older performance issues, it introduced 28 new ones. Specifically, 9 findings in BaseItemRepository.TranslateQuery.cs map directly to the filtering logic that users are now reporting as "unbearably slow."

The Verification: Issue #16279 ("Filters query taking 90s each time") isn't a mystery. It's a structural regression that was visible in the diff 1546 ms after it was written.

2. The Deadlock Time-Bombs (GCI0016)

Findings: 5 Block-level async violations (.Wait() and .GetAwaiter().GetResult()).
The Risk: These are "Heisenbugs." They often pass local CI because they require specific concurrency timing to hang a thread pool.
Status: These are currently sitting in the master branch. They haven't "exploded" yet, but the pattern is a well-documented deadlock vector in ASP.NET Core.

3. The Structural Decay (GCI0038 & GCI0043)

Beyond the crashes, the scan found:

45 Dependency Injection Violations: Service locator anti-patterns that create hidden coupling.
15 Type Safety Gaps: as casts without null checks that lead to context-less NullReferenceExceptions.

Execution Profile

Metric	Result
Total Findings	129
Block-Level (Merge Stoppers)	11
Scan Time	660 ms
LLMs Used	0

Why This Happened

The Jellyfin team is talented. The problem isn't the people: it's the Scale of Change vs. The Speed of Human Cognition. Reviewers check for intent; GauntletCI checks for structural risk.

Try It Yourself

You don't need an LLM to find these. You need a pessimistic verifier.

dotnet tool install -g GauntletCI
gauntletci analyze --staged

Forem: GauntletCI

The Asymmetry of Change: Why Your Tests Are Looking the Wrong Way