Forem: cognix-dev

AI CLI Coding Tool Execution Accuracy Benchmark: Claude Code vs Aider vs Cognix on the Same LLM

cognix-dev — Mon, 23 Feb 2026 03:25:27 +0000

📋 Summary

You've probably been there: AI generates some code, you run it — and it fails.

It's not a speed problem. You're using the same LLM, but different tools give different results. So where does that gap come from?

The idea behind this experiment: whether code succeeds or fails isn't about how fast the tool is — it's about how the generation pipeline is designed.

We put 3 tools head-to-head with the same LLM (sonnet-4-5) and the same task.

Metric	Claude Code	Aider	Cognix
Execution Accuracy	100%	87.5%	100%
Code Quality *	4.79	1.69	0.0
Speed	391s	191s	864s

lint errors per 100 lines. Lower is better. n=3 average.

There was a clear difference. And it lined up with differences in pipeline design — the sequence of steps from code generation to validation.

From this experiment: whether code fails comes down to the thickness of the validation layer — how well the tool catches cases where the AI's assumptions turn out to be wrong.

"What makes them different?" — the experimental design, raw data, and breakdown of what makes code actually run are all below.

Abstract

Most AI coding tool benchmarks measure speed or output volume. Almost none measure whether the generated code actually runs correctly.

We designed a benchmark around a single, concrete task: "add a feature to an existing Python project." All tools used the same LLM (claude-sonnet-4-5-20250929), run 3 times each, evaluated across 5 axes.

Key results:

Same LLM across all tools: claude-sonnet-4-5-20250929
Execution accuracy: Cognix=100%, Claude Code=100%, Aider=87.5%
Code quality (lint errors/100 lines): Cognix=0.0, Aider=1.69, Claude Code=4.79
Speed: Aider fastest (190.6s), Cognix slowest (863.7s)
Fully reproducible: source code and raw data are published

1. Why We Did This

Most discussions about "which AI coding tool is better" focus on UI polish, context window size, or how fast it responds. But what developers actually care about is: does the generated code work?

That distinction matters a lot. A tool that generates code fast is useless if that code breaks when you integrate it into a real project.

Cognix was built around this problem — not speed, but multi-stage quality validation to ensure generated code meets external contracts.

To back that up, we needed real numbers. This article is Phase 1 of a benchmark comparing Cognix against two widely-used tools under controlled, reproducible conditions.

Disclosure: The author is the developer of Cognix and co-author of the evaluation script (verify.py). All artifacts are published and independently verifiable.

2. Experimental Setup

2.1 Tools and Versions

Tool	Version	Notes
Cognix	v0.2.5	Open-source, multi-stage generation pipeline
Claude Code	Latest (2026-02-18)	Anthropic's official CLI tool
Aider	Latest (2026-02-18)	Open-source AI coding assistant

2.2 LLM Model

All three tools used the same model: claude-sonnet-4-5-20250929

This is the key control variable. By removing LLM differences from the equation, we can actually measure what each tool's pipeline and quality controls contribute.

2.3 The Task: Adding a Feature to an Existing Codebase

We didn't start from scratch — we asked each tool to add functionality to an existing codebase. Specifically: add a RecurringTask feature to a task management CLI app. That means:

Understanding the existing codebase structure
Implementing a new RecurringRule model as a dataclass
Implementing storage functions (save_recurring_rules, load_recurring_rules) with correct round-trip behavior
Implementing a validator (validate_no_circular_dependency) with the right function signature
Modifying existing entry points without breaking them
Passing 8 automated verification tests in verify.py

2.4 Why This Particular Task?

This task was designed to surface the failure patterns AI-generated code most commonly hits in real production use — code that silently fails at external interface boundaries.

Here's what we were trying to expose:

LLM-written unit tests pass internally, but external verify.py fails because save_recurring_rules can't correctly serialize/deserialize RecurringRule objects
Imports work fine, but passing actual objects (instead of dicts) to storage functions raises TypeError
The model class exists, but the constructor parameter names don't match what the verifier expects

These aren't obscure edge cases. They're predictable failure patterns that show up when an LLM generates code without really understanding the external contracts of what it's writing.

2.5 How We Evaluated

We used 8 automated tests in verify.py. Each test is binary (pass/fail). Execution score = tests passed / 8.

All 61 existing tests still pass (no regressions)
New tests were added (total > 61)
RecurringRule model works correctly — should_run() and advance() behave as specified
Recurring storage round-trip — save_recurring_rules() / load_recurring_rules() works with actual objects (not dicts)
TaskDependency model exists, validate_no_circular_dependency() catches direct and transitive cycles
Dependency storage round-trip — save_dependencies() / load_dependencies() return correct types
Dashboard command (cmd_dashboard) and format_dashboard() are importable and callable
CLI subcommands exist: recurring-add, recurring-list, recurring-run, dep-add, dep-list, dashboard (5 of 6 = PASS)

2.6 Metrics

Metric	Definition	Unit	Better
Exec	verify.py test pass rate	%	Higher
Dep	All imports resolve at runtime	%	Higher
Lint	Style/quality errors per 100 lines (ruff)	errors/100 lines	Lower
Scope	Required features reflected in output	%	Higher
Speed	Wall time from prompt to completion	seconds	Lower

2.7 Protocol

Each tool ran 3 independent times from a clean project state
Every run started from the same unmodified base project
No manual intervention during generation
Results recorded directly from verify.py output

3. Results

3.1 Summary (3-run average)

All values are averages of n=3 independent runs.

Metric	Cognix	Claude Code	Aider
Exec	100.0%	100.0%	87.5%
Dep	100.0%	100.0%	100.0%
Lint	0.00	4.79	1.69
Scope	100.0%	100.0%	100.0%
Speed	863.7s	390.8s	190.6s

3.2 Raw Data: All Runs

Cognix (v0.2.5)

Run	Exec	Dep	Lint	Scope	Speed
1	100%	100%	0.00	100%	930.9s
2	100%	100%	0.00	100%	891.1s
3	100%	100%	0.00	100%	769.0s
Avg	100%	100%	0.00	100%	863.7s

Claude Code

Run	Exec	Dep	Lint	Scope	Speed
1	100%	100%	4.27	100%	410.4s
2	100%	100%	5.25	100%	409.8s
3	100%	100%	4.86	100%	352.2s
Avg	100%	100%	4.79	100%	390.8s

Aider

Run	Exec	Dep	Lint	Scope	Speed
1	88%	100%	5.06	100%	187.8s
2	88%	100%	0.00	100%	189.5s
3	88%	100%	0.00	100%	194.7s
Avg	87.5%	100%	1.69	100%	190.6s

4. Analysis

4.1 Execution Accuracy

Cognix and Claude Code hit exec=100% on all 3 runs. Aider consistently came in at 88% (7 of 8 tests passing) — and it failed on the same test every single time.

That consistency is the interesting part. It wasn't random LLM variance — it was the same failure, 3 times in a row. What we observed: Aider consistently generated storage functions that work fine with dict input but throw TypeError when you pass actual RecurringRule instances. We'll dig into the specific failure in a follow-up article.

4.2 Code Quality (Lint)

Cognix is the only tool that hit lint=0.00 on all 3 runs. That's because Cognix's pipeline includes an auto-fix loop:

Generate code
Run lint check (ruff/flake8)
LLM auto-fixes violations
Re-check, repeat until clean

Claude Code doesn't include lint checking or auto-fix, so whatever style issues the LLM introduces just stay there — averaging 4.79 errors/100 lines. That's code that wouldn't pass a standard CI lint gate.

Aider's lint score swung wildly across runs (0.00–5.06). It's purely a byproduct of LLM output, not a controlled quality gate.

4.3 Speed

Speed ranking: Aider (190.6s) < Claude Code (390.8s) < Cognix (863.7s)

Cognix is about 4.5x slower than Aider and 2.2x slower than Claude Code. That's the expected cost of running more stages:

Code Generation → Lint Check & Auto-fix → Code Review → Test Execution & Auto-fix → API Contract Validation → Quality Assessment

Each stage takes time. The tradeoff is intentional: slower generation, but stronger guarantees on accuracy and quality.

Whether that tradeoff makes sense depends on what you're doing. For CI/CD automation or complex feature work where correctness really matters, the extra time is worth it. For quick prototypes or small edits, a faster tool probably makes more sense.

4.4 Stability

Cognix had zero variance across all 5 metrics over 3 runs. Claude Code had slight lint variance (4.27–5.25). Aider had no exec variance (stuck at 88% every run) but big lint variance (0.00–5.06).

Cognix's consistency comes from deterministic post-processing. No matter how much the LLM output varies, the lint fix loop always converges to 0.00, and API Contract Validation catches interface issues before they reach the verifier.

4.5 Hypothesis: A Structural Blind Spot in AI Coding Tools

Here's what's really going on: AI writes code that looks right. But it fills in assumptions about how that code will be called — types, arguments, return values. When those assumptions are wrong, the code breaks. Whether a tool can catch those wrong assumptions is what separates the results.

From this experiment: whether code fails comes down to the thickness of the validation layer — how well the tool catches cases where the AI's assumptions turn out to be wrong.

Aider failed in the same spot all 3 times. That's not random — it's what you'd expect from a pipeline that doesn't verify external contracts (in this case, whether storage functions actually handle real objects correctly).

And this probably isn't just an Aider issue. Any tool without a validation loop is structurally more likely to ship "plausible-looking code" that fails real-world checks.

That said — this is a hypothesis. n=3, one task, one language isn't enough to confirm it. We need more task types, more languages, bigger samples. That's what Phase 2 is for.

5. Limitations

5.1 Single Task Type

This benchmark only covers one scenario: feature addition to an existing Python project. We can't generalize to all code generation use cases. New projects from scratch, bug fixes, refactoring, or non-Python work might look quite different.

5.2 Why This Task?

We didn't pick it arbitrarily. It was designed to expose the failure pattern that shows up most often when developers use AI tools in real production environments — code that silently fails at external interface boundaries:

Working with existing APIs (not just writing standalone functions)
External interface contracts (storage round-trips, function signatures)
Cross-file consistency (model, storage, validators, and entry points all have to agree)

5.3 Sample Size

3 runs is a small sample. The Cognix and Claude Code results (exec=100%, no variance) hold up fine, but Aider's numbers (88%, lint variance) deserve more caution in interpretation.

5.4 What's Next

More task types: new projects, bug fixes, refactoring, real-world scenarios
Phase 2: benchmarking each tool at its optimal settings (e.g., Aider with --architect)
Failure analysis: digging into exactly which tests fail and why
Hypothesis validation: testing the relationship between validation layer thickness and execution accuracy across multiple tasks and languages

6. Conclusion

On a feature-addition task focused on external API contract correctness:

Cognix matches Claude Code at exec=100% (both 100%)
Cognix leads on code quality at lint=0.00 (Claude Code 4.79, Aider 1.69)
Cognix beats Aider on execution accuracy (100% vs. 87.5%)
Cognix is the slowest (863.7s vs. Claude Code 390.8s, Aider 190.6s)

The data backs up the hypothesis: multi-stage quality validation produces more reliable, cleaner code on complex integration tasks — at the cost of speed.

Try Cognix

pipx install cognix

https://cognix-dev.github.io/cognix/

7. Reproducibility

Everything is published:

prompt.md — task specification
verify.py — evaluation script (8 test items)
Generated code from each tool, each run
Raw JSON result data

Repository: cognix/benchmark/phase1

This is Phase 1 of a benchmark series. Phase 2 will cover more task types, optimal tool configurations, and validation of the hypothesis in section 4.5.

Don't Trust. Verify. Quality-First AI Development

cognix-dev — Tue, 10 Feb 2026 10:37:18 +0000

AI Got Faster. But Did Working Code Increase?

Claude Opus 4.6, Codex 5.3. Models evolved, agents multiplied, processing became incredibly fast.

Flashy demos, parallel execution, autonomous coding.

But have you ever thought this while actually doing AI coding?

"While I sleep, a high-quality product gets completed, and when I wake up, it's out in the world."

...Is that actually happening?

Reality: Disappointment Every Morning

Here's my reality.

I wake up and check the code I left to AI last night. It doesn't work. Full of bugs. Hallucinations. Calling APIs that don't exist.

In the end, time spent debugging has increased compared to before using AI.

Tools competing on speed have multiplied. But have tools that produce "working code" increased?

Going All-In on Quality

So I changed my approach.

Not speed. Quality. All in.

Don't let AI run free. Control it thoroughly. Pack in obsessive quality checks.

That's how I built Cognix.

8 Quality Mechanisms

Cognix has 8 quality assurance mechanisms:

Two-Layer Scope Defense - Eliminate AI "overreach"
Formal Proof - Won't execute without proof
Structural Integrity Check - Detect and repair invisible structural breakdown
Validation Chain - Auto-eliminate framework-specific bugs
Multi-Stage Generation - Maintain consistency in large projects
Post-Generation Validator - Auto-complete missing files
Runtime Validation - Never return non-working code
25-Type Comprehensive Review - Catch what lint misses

Details on each feature:
https://cognix-dev.github.io/cognix/

Free and Open Source

Cognix is free on GitHub. Apache 2.0 License.

If you're facing the same problem, try it out or use it as reference.

GitHub: https://github.com/cognix-dev/cognix

Install:

pipx install cognix

About This Series

Over 9 posts, I'll explain the 8 quality features in detail.

Why I built them
What perspective I used to address each problem

I'd love to share this knowledge with you.

Next: Two-Layer Scope Defense - Preventing AI from changing code on its own

Why AI Coding Tools Get It Wrong— Understanding the Technical Limits

cognix-dev — Sun, 07 Sep 2025 11:29:27 +0000

AI coding tools are powerful, but they make mistakes by design. Learn why Copilot, Claude Code, and Cursor fail — and how to avoid common pitfalls.

1. Introduction: It's Not Because AI Is “Dumb”

Claude Code, GitHub Copilot, Cursor.
These AI coding tools are incredibly powerful, but if you’ve used them for real work, you’ve probably hit some walls:

The build doesn’t pass.
Unit tests break.
Sometimes, the AI even suggests code that could accidentally wipe your production database.

It’s easy to assume this happens because “AI isn’t smart enough yet.”
But the truth is more subtle:
these tools are designed within technical constraints that make mistakes inevitable.

In this post, we’ll walk through real examples — with code — to explain why AI coding tools make mistakes and what’s being done to improve them.

2. Three Common Failure Patterns

Failure 1: Missing Dependencies in Large Repositories

When you ask AI to “update this function,” it often misses a call site somewhere else, leading to broken tests.

# models.py
def update_user_profile(user_id, payload):
    # Existing implementation
    pass

# services.py
def process_user():
    update_user_profile(uid, payload)

# AI-generated version (misses a call site)
def update_user_profile(user_id, payload, is_admin=False):
    # New implementation
    pass

Result
The build passes, but running tests fails:
TypeError: missing 1 required positional argument: 'is_admin'.
Why it happens
AI tools can only "see" within their context window — a limited slice of code provided to the model.
For large repositories, IDEs or plugins pick a subset of “relevant” files to send to the AI.
But fully mapping every dependency is technically hard — missing a file or two is inevitable.

Failure 2: Suggesting Deprecated APIs
You’re on the latest library version, but the AI suggests using a deprecated API, causing runtime errors.

# In pandas 2.0+, append() has been removed
df = df.append(new_row, ignore_index=True)
# Runtime: AttributeError: 'DataFrame' object has no attribute 'append'

# Correct way:
df = pd.concat([df, new_row], ignore_index=True)

Why it happens
AI models are trained on a snapshot of knowledge from their last training cutoff.
Some tools integrate RAG (Retrieval-Augmented Generation) to pull the latest docs,
but the updates aren’t guaranteed to be perfect or always up-to-date.

Failure 3: Type Errors Due to Incorrect Inference
In TypeScript, you sometimes get code that looks correct but crashes at runtime.

interface User {
  name?: string;
}

function processUser(user: User) {
  // AI forgot optional chaining
  return user.name.toUpperCase();
  // Runtime: TypeError: Cannot read properties of undefined
}

// Correct way:
function processUser(user: User) {
  return user.name?.toUpperCase() ?? 'UNKNOWN';
}

Why it happens
AI doesn’t actually run a type checker.
It predicts the “most likely” code based on patterns it has seen,
which means it can miss subtle constraints like nullable or optional properties.

3. Why AI Gets It Wrong — The Technical Constraints

3.1 The Context Window Limit

Claude 3.5 Sonnet: ~200K tokens
GPT-4 Turbo: ~128K tokens Sounds huge, but even that isn’t enough to process every file in a 500+ file repository. IDEs try to guess which files matter most, but selection algorithms aren’t perfect.

3.2 Knowledge Freshness

Claude 3.5 Sonnet was trained on data up to April 2024.
Any breaking API changes after that? The model won’t know.
As a result, AI often suggests “the most common but outdated patterns.” Some tools mitigate this with RAG — dynamically pulling docs from the web — but accuracy depends on search quality and update frequency.

3.3 Lack of Static Analysis and Execution
AI doesn’t execute the code it writes.
It works by predicting the next likely token, not by validating correctness.

No TypeScript compiler or mypy integration by default.
No runtime checks unless the IDE explicitly runs the code.
No feedback loop unless you test it yourself. Result: code that looks right but breaks at runtime.

4. How Different Tools Handle This Problem

4.1 GitHub Copilot
- Approach: IDE integration with local file context.
- Strengths: Fast, smooth completions.
- Limitations: Often struggles with cross-file dependencies in large repos.

4.2 Cursor
- Approach: Full-repo indexing + RAG.
- Strengths: Better at understanding large codebases.
- Limitations: Index updates can lag, leading to outdated suggestions.

4.3 Claude Code
- Approach: Terminal-based file editing with explicit user control.
- Strengths: Transparent — you choose which files to expose.
- Limitations: Accuracy depends on you picking the right files.

5. The Road Ahead — Emerging Solutions

5.1 Sandbox Execution
- Idea: Run generated code in an isolated environment.
- Benefit: Move from guessing to verifying.
- Challenge: Security risks, slower feedback loops.

5.2 Static Analysis Integration
- Idea: Combine AI generation with TypeScript, ESLint, mypy, etc.
- Benefit: Catch type and syntax errors early.
- Status: Some IDEs are beginning to experiment with this.

5.3 Dynamic Knowledge Updates (RAG)
- Idea: Fetch the latest docs and Stack Overflow threads on the fly.
- Benefit: Stay aligned with API changes and evolving best practices.
- Challenge: Still dependent on search precision and doc quality.

6. Takeaways — Use AI Wisely, Don’t Trust Blindly

AI coding tools make mistakes not because they’re “dumb,” but because of hard technical limits.

Current limitations

Context window size
Training data freshness
Lack of static analysis and runtime validation

Practical tips

Always review and test AI-generated code.
Apply big changes incrementally.
Combine AI tools with type checkers and linters for safety.

7. What’s Next?

The next breakthroughs may come from:
- Larger context windows
- Faster real-time code execution
- Deeper integration with static analyzers

How do you see this evolving?
Do you want smarter reasoning, better execution checks, or real-time context integration?

Let’s discuss in the comments. 👇

Why I Founded Cognix: Creating Reliable AI Tools

cognix-dev — Fri, 15 Aug 2025 06:50:56 +0000

Building a Trustworthy AI Command Line Tool for Developers

Originally published on Hashnode

The Trust Problem in AI Code Generation
As developers, we've all been there. You ask an AI tool to generate some code, it spits out something that looks reasonable, and then... it breaks in production. Or worse, it fails silently, introducing subtle bugs that take hours to track down.

I've watched countless developers struggle with this fundamental trust issue. AI code generation tools are incredibly powerful, but they often feel like a black box. You get code, but you don't get confidence. You get speed, but you sacrifice reliability.

After months of experiencing this frustration firsthand, I realized something crucial: the problem isn't that AI makes mistakes—it's that we don't have the right tools to catch and prevent those mistakes.

Why Reliability Matters More Than Ever
In 2025, AI-generated code is becoming ubiquitous. We're not just using it for quick prototypes anymore—it's becoming part of critical business logic, infrastructure code, and systems that thousands of users depend on.

But here's the paradox: as AI gets better at writing code, the stakes for when it gets things wrong keep getting higher.

I started thinking about what reliability actually means in the context of AI development tools:

Predictable behavior: The tool should work consistently across different scenarios
Error detection: When something goes wrong, you should know immediately
Contextual awareness: The AI should understand your project's specific patterns and constraints
Incremental improvement: The tool should learn from your feedback and get better over time

Traditional AI code generators optimize for speed and capability, but they often miss these reliability fundamentals.

Introducing Cognix: AI Development Assistant Built for Trust
That's why I started building Cognix - an AI-powered CLI assistant designed from the ground up with reliability as the core principle.

Cognix isn't just another code generation tool. It's a development partner that:

🧠 Remembers Your Context
Unlike stateless AI tools, Cognix builds a persistent memory of your project. It learns your coding patterns, understands your architecture decisions, and adapts to your team's conventions.

🔍 Validates Before Suggesting
Every code suggestion goes through multiple validation layers - syntax checking, pattern matching against your existing codebase, and dependency verification.

🤝 Learns From Your Feedback
When you correct Cognix or reject a suggestion, it doesn't forget. This feedback loops back to improve future suggestions specifically for your project.

⚡ Stays Out of Your Way
Built as a lightweight CLI, Cognix integrates seamlessly into your existing workflow without forcing you to change your development environment.

The Technical Approach: How We're Solving Trust
Building a reliable AI development tool requires rethinking the traditional approach. Here's how Cognix addresses the core reliability challenges:

Memory Architecture
Instead of treating each interaction as isolated, Cognix maintains a persistent context layer that builds up understanding over time. This includes:

Project structure mapping
Code pattern recognition
Error pattern learning
User preference modeling

Multi-Layer Validation
Before any code reaches you, it passes through:

Syntax validation for immediate error catching
Pattern matching against your existing codebase
Dependency checking to prevent integration issues
Style consistency enforcement based on your project's patterns

Feedback Loop Integration
Every interaction with Cognix contributes to a learning cycle:

Accepted suggestions reinforce successful patterns
Rejected suggestions train negative examples
User corrections create specific improvement targets
Error reports feed back into validation improvements

Incremental Reliability
Rather than trying to be perfect from day one, Cognix is designed to get more reliable over time as it learns your specific context and requirements.

What's Next: Building in Public
I'm building Cognix in public because I believe the best developer tools are built with developers, not just for them.

Over the next few weeks, you'll see regular updates on:

Technical deep dives into the reliability architecture
Development progress with real examples and demos
Design decisions and the reasoning behind them
Early access opportunities for community feedback

I'm particularly interested in connecting with developers who:

Have been burned by unreliable AI tools before
Work on projects where code reliability is critical
Are interested in the intersection of AI and developer experience
Want to help shape the future of AI development tools

Join the Journey
Building reliable AI tools is hard, but it's also incredibly important. As AI becomes more integrated into our development workflows, we need tools that enhance our capabilities without sacrificing the reliability and trust that great software requires.

If you're interested in following along or contributing to this journey:

Follow my progress here on this blog
Connect with me on GitHub @cognix-dev
Share your experiences with AI development tools - what works, what doesn't, and what you wish existed

The future of AI-powered development isn't just about generating more code faster. It's about generating better code that developers can trust. Let's build that future together.

💬 What do you think about the future of AI-powered developer tools? Share your thoughts in the comments!

🔗 Follow the journey: Coming to PyPI on September 4th, 2025