Why I am building a tool to auto-execute LLM code suggestions.

Ajay Singh — Tue, 16 Dec 2025 16:12:36 +0000

The Problem
I've wasted too many hours this week on the "AI Debug Loop":
Ask ChatGPT for a fix.
Paste code. It crashes.
Ask Claude for a fix.
Paste code. It crashes differently.
Ask Gemini. It invents a library that doesn't exist.
We treat LLMs like Oracles, but for code, they are often just confident liars.
The Idea: A "Truth Engine" for Code
I got tired of being the manual tester for these models. So, I’m working on a script that automates the verification process.
Instead of asking one model, the tool:
Queries the Council: Sends your bug/prompt to GPT-4o, Claude 3.5 Sonnet, and Gemini 1.5 Pro simultaneously.
The Sandbox: It spins up isolated Docker containers for each solution.
The Execution: It actually runs the code to check for runtime errors.
The Verdict: It discards the hallucinations and gives you the code that compiled.
What the output looks like
I'm building the CLI right now, but here is the concept:
Plaintext

Analyzing bug in 'auth_controller.py'...

┌─────────────┬─────────────┬─────────────┐
│ Model │ Status │ Result │
├─────────────┼─────────────┼─────────────┤
│ GPT-4o │ ✅ PASSED │ Runtime OK │
│ Claude 3.5 │ ✅ PASSED │ Runtime OK │
│ Gemini 1.5 │ ❌ FAILED │ Syntax Err │
└─────────────┴─────────────┴─────────────┘

[Recommended Fix]: Claude 3.5 (Fastest execution time)
Why do this?
Because I'd rather wait 30 seconds for a verified answer than spend 10 minutes debugging a hallucination.
Want to test it?
I’m currently running this workflow manually to benchmark how often the models disagree.
If you have a bug or a snippet that AI keeps messing up:
Drop it in the comments (or DM me).
I’ll run it through the "Council" and reply with the comparison results.
I’m trying to figure out if this is worth building into a full CLI tool or SaaS. Let me know what you think!

Forem: Ajay Singh

Why I am building a tool to auto-execute LLM code suggestions.