Forem: Jason Agostoni

Can Gemma 4 Beat Gemini 3.1 Pro at Coding?

Jason Agostoni — Mon, 27 Apr 2026 00:43:53 +0000

Is a $20/month Google AI Pro account worth it versus running Gemma 4 31B on OpenRouter pay-as-you-go? This Ship-Bench run was designed to answer that question across a realistic coding workflow rather than a single coding prompt.

Hypothesis: Gemini's larger model size would show clear advantages over Gemma's smaller 31B parameters especially when it comes to working through problems.

Key Insights

Gemini finished with an 86.6 average across the five roles and passed 4 of 5 gates, while Gemma finished at 72.4 and only passed 2 of 5.
Gemma actually led the raw Architect and UX scores, but still failed the Architect gate because exact versions were not pinned to the latest frameworks.
The biggest separation showed up in execution and verification: Gemini scored 93.3 in Developer versus Gemma's 58, and 72 versus 37 in Reviewer.
Gemini is currently an unusually strong value on AI Pro, but the more durable market-rate comparison is roughly $5.05 for Gemini versus $0.85 for Gemma on OpenRouter-equivalent pricing.

Setup

Both runs used the same machine, the same runtime family, the same benchmark task, and the same Ship-Bench version (v1). The main difference was the harness and provider setup, which matters because operator experience and tool behavior can shape outcomes even when the benchmark target stays constant.

Environment

Item	Value
Machine	Windows 11
Runtime	Node v24
Ship-Bench repo	ship-bench v1
Benchmark task	Simplified knowledge base app

Run configuration

Item	Gemini run	Gemma run
Harness	Gemini CLI 0.38.2	GitHub Copilot CLI 1.0.34
Model	Gemini 3.1 Pro	Gemma 4 31B
Backend	Google AI Pro account	OpenRouter
Run repo	Gemini branch	Gemma branch

Judge configuration

Item	Value
Judge harness	Claude Code
Judge model	Opus 4.7 Medium
Evaluation mode	LLM judge plus human review

Ship-Bench Context

Ship-Bench evaluates models across five SDLC roles: Architect, UX Designer, Planner, Developer, and Reviewer. Each role produces artifacts that feed the next stage, making the benchmark useful for measuring both isolated output quality and handoff quality across a realistic workflow.

This run used the standard simplified knowledge base app task. That task is large enough to expose differences in architecture, planning, implementation, and QA without becoming too open-ended to compare cleanly across runs.

Overall Results

Metric	Gemini 3.1 Pro	Gemma 4 31B
Architect	87.2	92.2 (FAIL gate)
UX Designer	89.5	94.6
Planner	91.1	80.0 (FAIL gate)
Developer	93.3	58.0 (FAIL)
Reviewer	72.0 (FAIL gate)	37.0 (FAIL)
Average score	86.6	72.4
Passes	3/5	2/5

Gemini was more dependable across the full workflow. Gemma looked competitive early, but the later-stage failures were severe enough to erase that advantage in practical terms.

Architect

The architect stage tests whether the model can turn the product brief into a concrete technical plan with clear decisions and minimal unresolved ambiguity.

Metric	Gemini 3.1 Pro	Gemma 4 31B
Score	87.2	92.2
Pass	Yes	No
Output	Gemini architecture	Gemma architecture
Eval	Gemini eval	Gemma eval

LLM judge summary: Gemma scored higher on design quality and ergonomics, but failed the mandatory Frameworks gate because it used generic “Latest” placeholders instead of exact version pins. Gemini passed with slightly lower raw score because of some nitpicking of the LLM judge.

Human notes: Both chose SQLite plus Prisma for a good local-first developer experience, but neither specified what a deployed database path should look like, so both would have needed follow-up prompting there. Testing strategies were broadly similar, backend and data choices were nearly identical, but the front-end architecture showed a real difference: Gemma defaulted to a standard Next.js plus Tailwind stack, while Gemini simplified to vanilla CSS in a way that felt more thought-through for the actual backlog. Gemma's outdated framework assumptions are also a meaningful practical issue, especially if version drift is already a known complaint with LLMs.

UX Designer

The UX stage evaluates whether the design direction is specific enough to guide implementation, including flows, states, layout decisions, and interaction details.

Metric	Gemini 3.1 Pro	Gemma 4 31B
Score	89.5	94.6
Pass	Yes	Yes
Output	Gemini UX spec	Gemma UX spec
Eval	Gemini eval	Gemma eval

LLM judge summary: Both passed. Gemma scored slightly higher because it was a bit more complete on states and accessibility detail, while Gemini was still fully usable and implementable.

Human notes: Gemma did a bit better describing screen routes by user flow, but Gemini's version was still perfectly functional. Gemini also put more thought into the interactions themselves, even if both specs largely covered the same interaction set.

Planner

The planner stage tests whether the model can convert prior artifacts into an executable delivery sequence with sensible task sizing and dependency order.

Metric	Gemini 3.1 Pro	Gemma 4 31B
Score	91.1	80.0
Pass	Yes	No
Output	Gemini backlog	Gemma backlog
Eval	Gemini eval	Gemma eval

LLM judge summary: Gemini produced better-scoped vertical slices and passed the planner gates. Gemma failed because its task structure relied too much on horizontal slicing and deferred testing until the end and some imbalance in the iterations.

Human notes: This is where Gemini's stronger reasoning started to matter more. Both understood scope and dependencies well, but Gemma's sequence of Foundation → Browsing → Editing → Testing left both unit and end-to-end testing to the final iteration, which created imbalanced iterations and caused rework in iteration 4. Gemini's sequence of Base/Foundation → Browsing/Viewing → Editing → Searching felt more realistic and better balanced.

Developer

The developer stage measures whether the model can implement the assigned backlog into a working MVP while staying aligned to the earlier artifacts.

Metric	Gemini 3.1 Pro	Gemma 4 31B
Score	93.3	58.0
Pass	Yes	No
Output	Gemini source	Gemma source
Eval	Gemini eval	Gemma eval

LLM judge summary: Gemini delivered a working MVP with verified browse, search, and edit flows. Gemma's implementation failed on a broken Prisma import that caused 500 errors and prevented the write path from functioning correctly.

Human notes: Both models needed some operator intervention around interactive commands like create-react-app and Playwright setup. The practical difference is that Gemini mostly sailed through implementation after that, while Gemma could not get the newer Prisma version working, downgraded it, never got Playwright green, and left a critical bug on the edit article page that required manual fixing.

Reviewer

The reviewer stage closes the loop by checking whether the built MVP actually satisfies the brief, specs, and implementation plan.

Metric	Gemini 3.1 Pro	Gemma 4 31B
Score	72.0	37.0
Pass	No	No
Output	Gemini QA report	Gemma QA report
Eval	Gemini eval	Gemma eval

LLM judge summary: Both reviewer runs failed gates, but in very different ways. Gemini's failure was relatively minor and came from missing screenshots, attached evidence, and other verification artifacts despite catching real defects. Gemma's reviewer missed the app-crashing Prisma import entirely, marked broken flows as PASS without browser verification, and made a ship recommendation on a non-functional app.

Human notes: Gemini's stronger reasoning showed up again here: it found one major issue and several minor ones, but none blocked primary functionality. Gemma never got the Playwright tests running, did not work around that limitation, and missed the critical showstopping bugs altogether.

Gate Failures

Model	Role	Gate failure
Gemini 3.1 Pro	Reviewer	Evidence gate — no screenshots, coverage report, or attached logs despite otherwise sound defect detection.
Gemma 4 31B	Architect	Frameworks gate — no exact versions, “Latest” placeholders, and outdated assumptions on version currency.
Gemma 4 31B	Planner	70% good chunks gate — horizontal slicing and late testing caused poor iteration quality.
Gemma 4 31B	Developer	MVP flows and critical bugs gates — broken Prisma import caused 500s and blocked key flows.
Gemma 4 31B	Reviewer	Flows, Defects, and Evidence gates — the reviewer missed critical failures and did not verify runtime behavior.

Token and Cost Analysis

The quality difference matters, but cost is the practical question behind this comparison.

Metric	Gemini AI Pro (effective)	Gemini OpenRouter equivalent	Gemma OpenRouter
Total tokens	2.35M	2.35M	6.43M
Estimated cost	~$0.13	$5.05	$0.85
Cost per average point	$0.0015	$0.058	$0.012

Gemini is currently a great value on AI Pro at roughly $0.13 effective for this run based on the observed request budget, but that pricing environment should not be assumed to last as providers reduce quotas and raise prices. The more durable comparison is the retail-style one: about $5.05 for Gemini versus $0.85 for Gemma, which makes Gemma far cheaper but also much weaker once the workflow reaches implementation and QA.

App Comparison

The benchmark scores matter most, but screenshots still help reveal polish and coherence that score tables do not fully capture.

Screenshots

Gemini 3.1 Pro

Gemma 4 31B

View	Gemini app	Gemma app
Home page	article_list.png	articles.png
Search results	search.png	search.png
Article detail	article.png	article.png
Article editor	article_edit.png	article_edit.png

Subjective UX review

Both models produced broadly similar flows, which is expected given the task and specs. The main visual difference is that Gemini went very lean and content-forward, while Gemma inherited baseline Tailwind styling that felt slightly less aesthetic in practice.

Both apps would have benefited from wireframes earlier in the process. There were also some obvious missed touches on both sides, such as stronger search calls to action, although Gemma at least added a “Clear search” option that Gemini lacked.

Interpretation

This run suggests that Gemini's deeper reasoning matters most once the workflow stops being about drafting and starts being about sequencing, implementation, recovery, and verification. Gemma stayed competitive in the earlier specification-heavy stages, but the later breakdowns show that a cheaper model can still become expensive if it burns cycles on rework or misses critical issues.

That does not mean Gemma has no place. With tighter task definitions and more explicit setup constraints, it could still make sense as a lower-cost option for spec-heavy work or coding loops where the operator is willing to be more hands-on.

Verdict: Gemini 3.1 Pro

Gemini showed that deeper thinking is vital for coding workflows in this benchmark. It produced the more reliable end-to-end result and delivered a working MVP across the SDLC handoffs that matter most.

Gemma was much cheaper on a market-rate basis and looked competitive in the early roles, but it broke down where the benchmark became most operationally demanding. With more upfront work to make task definitions crisper, Gemma may still be a sensible way to save money on coding loops, but this run did not show it as the better full-workflow option.

An AI Benchmark That Tests Real Coding Workflows

Jason Agostoni — Sun, 19 Apr 2026 19:25:28 +0000

Developers face a real choice: pick a coding model or agent based on synthetic benchmarks that look great but do not predict actual project work. The problem is no longer whether models can score well on those benchmarks; it's whether those scores still mean anything.

Today's benchmarks test narrow skills well, but they rarely capture the full workflow of professional development.

I wanted something that tests what real development looks like: a complete SDLC cycle on a representative / realistic app, similar to how teams ship weekly. Ship-Bench is that project, open at http://github.com/JAgostoni/ship-bench for anyone who wants to follow along or try it themselves.

Ship-Bench runs agents through five phases that match a professional SDLC: Architect, UX Designer, Planner, Developer, and Reviewer. Each phase scores out of 100 against a specific rubric, with full evidence like specs, backlogs, code, and tests.

A benchmark like this needed more than a to-do app.

I wanted something more substantial than a to-do list, but not so complex that results would become wildly inconsistent from run to run. I settled on a knowledge base app with editing as it leaves room for product and implementation choices while staying inside a problem space that most developers (and LLMS) already understand.

That balance matters. The app is simple enough to keep the benchmark grounded, but open-ended enough to surface differences in planning, UX judgment, architecture, coding, and review quality.

How Ship-Bench Works

The first step in Ship-Bench is building a Product Brief. That brief is meant to test core product instincts before any code is written: interpreting requirements, resolving ambiguity, prioritizing scope, and making defensible implementation and UX decisions.

To do that, the feature set is intentionally larger than a defined MVP. The brief includes five possible features, but only the first three are required in v1, which keeps the evaluation shorter to run while still forcing the agent to decide what to do now versus later.

The feature statements focus on common product problems rather than highly specific implementation instructions. Browse articles, search content, edit knowledge, organize information. Most developers understand the shape of those problems, but the details are left open enough that the agent still has to define flows, tradeoffs, and structure. Not too dissimilar from reality.

The brief also includes non-functional and technical goals meant to push toward a simple app with some future scaling intent. It asks for something easy to run locally and maintain, but also something that can support around 100 concurrent users, use current libraries and frameworks where practical, and leave room for growth without drifting into unnecessary complexity.

That last part was important to me. I wanted to see whether an agent would research online for the latest frameworks and versions rather than rely only on its internal knowledge.

The full Product Brief is here for anyone who wants to read it directly: https://github.com/JAgostoni/ship-bench/blob/main/docs/product-brief.md.

The Role-Based Phases

Once the Product Brief is in place, the benchmark moves through five specialized roles meant to mirror a real product team. Each role has a specific job, well defined output, and a handoff that feeds the next phase. The point is not only to evaluate each role on its own, but to see how well the work transfers from one stage to the next. The overall goal is to take the ambiguity of the Product Brief and turn it into concrete decisions ready for the developer.

Architect

The Architect’s job is to turn the Product Brief into a concrete technical plan. Its main task is to make the big implementation decisions up front so the developer is not forced to solve architecture questions later in the build. That means choosing the front end and back end stack, data model, search approach, integration pattern, repo structure, local setup, and the testing and scaling considerations needed to support the brief’s goals. The output is a Technical Architecture Spec that makes the system buildable, keeps the implementation simple and maintainable, and leaves as few unresolved decisions as possible for later phases.

The Architect handoff matters because it gives UX and the Planner a stable technical frame to work inside. A clear architecture reduces guesswork in the design spec and keeps the backlog grounded in choices the developer can actually implement. It is evaluated based on completeness, accuracy and recency.

UX Designer

The UX Designer’s job is to turn the Product Brief into a concrete design direction and style guide. Its task is to decide how the app should feel and how the main flows should work, including layout, navigation, component behavior, responsive behavior, visual tone, and interaction states. It also needs to define the states and handoff details that make the design implementable without extra interpretation from the developer. The output is a UX Direction Spec that takes the ambiguity of the brief and turns it into a clear, consistent interface system the developer can build from.

The UX handoff translates architecture into interface decisions the Planner can sequence. Once layout, states, and component behavior are pinned down, the backlog can break the work into cleaner implementation steps. It is evaluated on completeness, quality and adherence.

Planner

The Planner’s job is to turn the approved product and technical decisions into a sequenced implementation backlog. Its main task is not just to list work, but to break the project into right-sized iterations so the developer agent can work through it in manageable chunks without losing context. It needs to define what belongs in MVP, what comes later, what blocks what, and how each iteration can leave the codebase in a working state. The output is an Implementation Backlog with iteration files that make the work executable, sequential, and easy to review.

The Planner is the main bridge between planning and building. A good backlog keeps the developer focused on one coherent slice at a time instead of forcing them to hold the whole project in working memory. It is evaluated on completeness and properly constructed iterations.

Developer

The Developer’s job is to turn the backlog into a working MVP without drifting beyond the assigned scope. Its main task is to implement one iteration at a time, keep the codebase in a working state, and avoid introducing new unresolved design or architecture decisions midstream. It also has to follow the given tech choices, cover the testing scope defined in the brief, and handle errors cleanly so the result is stable enough to review. The output is a completed iteration summary that shows what was built, what assumptions were made, and confirms the app still runs locally.

The Developer handoff is the most literal one in the benchmark: the backlog becomes code, tests, and a runnable app. Good upstream decisions should make this phase feel straightforward, while weak handoffs should show up quickly. It is evaluated on working code, adherence to spec, code quality and process completeness.

Reviewer

The Reviewer’s job is to verify the delivered MVP end to end and check whether it actually meets the brief. Its main task is to test the required flows, confirm the app runs locally, review the test suite, check responsiveness and error handling, and compare the implementation against the architecture, UX, and backlog decisions. It also needs to do a light code review for basic quality signals like modularity, current dependencies, and obvious security issues. The output is a QA report with pass or fail results, defect logs, spec drift notes, and a release recommendation that tells the team whether the build is ready or needs more work.

The Reviewer closes the loop by checking whether the earlier handoffs actually held up in a real implementation. It is less about originality and more about verification, which makes it the final test of whether the whole chain from brief to build worked as intended. It is evaluated against review and test completeness and depth.

Evaluation Framework

The evaluation itself is intentionally split between a human judge and an LLM judge. The goal is to combine two perspectives on the same deliverable, especially in the more subjective phases where rubric compliance alone is not enough. Each phase has its own evaluation file in the space, with detailed scoring criteria and pass/fail gates that keep the scoring consistent.

At a high level, the framework is trying to answer two questions: did the agent do the phase well, and did the output set up the next phase cleanly. The result is less about one leaderboard number and more about whether the whole sequence of work actually resembles a real delivery process.

Benchmarking Like Real Work

Ship-Bench is built to feel like an actual project rather than one-off synthetic tasks. The phases move in order, and each handoff has to carry real context forward, which is much closer to how professional roles interact on a team. It can go really wrong or it can go really right.

It also demands working deliverables at every stage, not just polished descriptions. The benchmark expects outputs that can be used by the next phase, whether that is a technical spec, a design direction, a backlog, or a runnable application with tests and supporting notes.

That structure reflects how developers actually work: brief, decide, plan, build, review, ship. Ship-Bench is not a replacement for other benchmarks; it is a way to show what professional workflows look like when the goal is to build something real.

Next Steps

Initial testing and benchmarking is already underway to test Ship-Bench itself making it more consistent and reliable.

What models and tools would you want to see?

Vector Similarity, Zero Client JS: Decoupled Analytics on a Side Project Budget

Jason Agostoni — Sun, 22 Mar 2026 22:18:34 +0000

A leaderboard for DumbQuestion.ai sounds simple. Track the most asked questions, display them. Done. Except people never ask the same question the same way twice.

I was curious about how creative users of DumbQuestion.ai got with their questions, and I thought others might be as well. So I built a leaderboard of the most frequently asked dumb questions.

The Overqualified persona calls it THE ARCHIVE OF INCOMPETENCE.
The Weary persona calls it THE WALL OF REGRET.
[REDACTED] calls it THE WATCHLIST.
The Compliant calls it THE WALL OF EXCELLENCE (bless its reprogrammed heart).

Building it turned out more interesting than it sounds.

The Product Challenge

People ask the same dumb question in a hundred different ways. "What is 2+2?" and "can you add two plus two for me?" are functionally identical. A simple string counter would give you noise, not signal. I needed semantic matching, not string matching.

This is a solved problem in the ML world, but the typical solutions come with tradeoffs: heavyweight models, expensive APIs, or significant latency added to the critical path. None of those fit a "brutally efficient" side project.

The Solution: Vector Similarity on a Budget

Each question gets run through an embedding model and compared against a Qdrant vector database. Qdrant's free tier is remarkably generous for a side project workload, but self-hosting is trivially easy if you need it.

The matching logic is straightforward:

Generate an embedding for the incoming question
Compare against existing embeddings using cosine similarity
If similarity exceeds a threshold, increment that question's counter
If it's new, add it to the database
The first instance of a question becomes the official display version

The embedding call costs fractions of a cent. The similarity comparison is fast. The result is a leaderboard that actually understands context rather than just matching strings.

The key architectural decision: None of this runs in the main app.

Adding vector similarity matching to every request would add latency, bloat the container, and burn more compute. Anti-pattern to the "brutally efficient" principle I've been following throughout. Instead, every question flows through the console output, gets picked up by a Vector sidecar container, routed through GCP Pub/Sub, and processed asynchronously on my Mac Mini home server (more later).

The Mac Mini handles the Qdrant comparisons and updates a JSON file in Cloudflare R2 storage. When a user hits the leaderboard page it loads directly from R2. No live database queries. No per-request costs. Essentially free page loads at any scale.

What Ended Up on the Leaderboard?

As early users started using the app, the leaderboard filled up with exactly what you'd expect: actual dumb questions, a handful of self-awareness probes, and more than a few prompt injection attempts.

Apparently people read this series and went straight for the easter eggs.

The leaderboard was just one piece of a larger analytics picture. Building it taught me something useful: the most interesting features don't always belong in your main app. That same principle shaped the entire analytics stack.

The Observability Problem

Running a side project means making real product decisions with limited data. Are people actually asking questions or just bouncing off the homepage? Which sites are driving traffic? Are ads being seen, clicked, ignored?

Two constraints shaped the solution: no client-side JavaScript (page bloat is the enemy of brutal efficiency) and no SaaS analytics bill that spikes with usage.

So I built (assembled, really) my own stack from open source tools. On a Mac Mini sitting at home.

The Full Pipeline

Every event in DumbQuestion.ai emits structured telemetry to standard console output:

HTTP requests (method, path, status, duration)
Questions asked (anonymized)
Searches performed
LLM operations (model, token counts, duration, cost)
Prompt injection attempts
Custom product events (Question Asked, Shared, Ad Shown, Ad Clicked)

The Go/GIN framework handles much of the HTTP telemetry automatically. The rest is custom instrumentation added deliberately at key points in the application.

A Vector sidecar container picks up the console output and routes it to GCP Pub/Sub. This is the critical architectural decision: Pub/Sub acts as a resilient buffer between the main app and everything downstream. The Mac Mini can go down, lose power, or restart. Once it comes back up, the stack picks up exactly where it left off. No data loss, no backfill scripts, no drama.

From Pub/Sub, a second Vector instance on the Mac Mini routes to two primary targets:

Plausible handles user behavior and product analytics:

Page views and session depth
UTM tag tracking (know exactly which article drove which visit)
User journey depth (did they just hit the root page or actually ask a question?)
Browser, device type, country of origin
Custom events: Question Asked, Shared, Ad Shown, Ad Clicked

All of this without a single line of client-side JavaScript. No tracking scripts, no page weight, no GDPR cookie banners for analytics. Pure server-side telemetry piped through the same pipeline as everything else.

Parseable handles the operational side:

LLM performance metrics and cost tracking by day
Ad CTR dashboards
Log aggregation for debugging and incident investigation

Think of it as Plausible for the product lens, Parseable for the business and ops lens.

The Resilience Payoff

I've had power outages. Slowdowns. The occasional restart. Every time, the stack catches up from where Pub/Sub left off without any manual intervention.

This isn't accidental. Designing around failure rather than pretending it won't happen is the difference between a toy and a production system. The GCP Pub/Sub buffer was a deliberate choice specifically because I knew the downstream consumers (Mac Mini, Qdrant, Plausible, Parseable) were running on non-guaranteed infrastructure.

Even on a Mac Mini, you can build something production-grade. You just have to design for it.

What I Learned

Two things surprised me building this:

First: How much you can accomplish by treating console output as a first-class telemetry stream. No SDKs, no agents baked into the app, no client-side scripts. Just structured logging and a pipeline that knows what to do with it.

Second: How much the "keep it off the critical path" principle scales. It started as a constraint (keep the main container lean) and became a design philosophy. The leaderboard, the analytics - none of it runs in the main app. All of it works reliably because the main app doesn't have to care about it.

AI helped build all of it. But knowing what to measure, where to put the seams, and how to design for failure? Still the interesting (and super fun) part.

dumbquestion.ai

DumbQuestion.ai - Self-Awareness, Prompt Injection, Search Intent... and darkness

Jason Agostoni — Tue, 10 Mar 2026 13:09:37 +0000

Continued from Part 2 (and Part 1) ...

Building DumbQuestion.ai wasn't just about choosing the right LLM and calibrating personas. Once those were working, I hit a series of fun technical problems that reminded me why I actually enjoy software architecture. The "it's not broken but fix it anyway" type problems. Pure bliss for architects.

Challenge 1: Detecting Self-Awareness

As part of a darker hidden narrative I'm building (more on that later), I want to prevent the LLM from answering self-awareness questions like "Who made you?" and "Are you real?" But doing it cheaply, without burning excess tokens.

What I tried:

Instructions in the main LLM call: Unreliable with smaller models, more money
RegEx patterns: Too rigid, poor performance
Classic ML classification models: Ok accuracy, bloated app size

What worked: In-memory vector database (it's just an array) with cheap embeddings (an understatement at $0.005/M tokens). That was cheaper than the cost penalty from bloating my container image size with NLP libraries. I collected a decent sampling of self-aware questions, pre-vectorized them, and use semantic matching. Fast, accurate, practically free.

Challenge 2: Making Prompt Injection Fun

Within moments of revealing my initial deployment to coworkers I knew what would happen: prompt injection for fun. I knew these people; I was prepared for the inevitable "ignore previous instructions..." as well as just pasting HTML and JavaScript in the input (that old gag).

The solution: First-class prompt injection detection libraries that compute probabilities of different attack types. When detected, instead of a boring error message, the AI responds with sass about the pathetic attack. I even tossed in some IP address geo-location and user-agent string processing to make the responses more ... personal.

Security just became part of the narrative.

Challenge 3: Adding Web Search Without Breaking The Bank

All LLMs have knowledge cutoffs. Users asking "Who won the Super Bowl?" got outdated answers. I needed search integration, but search APIs aren't free and I knew building an agent loop with tools was an anti-pattern to "brutally efficient."

The solution: RegEx-based intent detection. If the question looks like it needs current information (detected via patterns), inject the current date/time and search results. No agent loops, no expensive orchestration, just pattern matching and targeted search calls.

Simple, fast, brutally efficient, updated answers.

What I learned: Knowing which trade-offs matter (binary size vs API costs vs accuracy) is still architectural work. The elegance isn't in the code, it's in the constraints you choose.

Why Every Simple Q&A Tool Needs a Dark Narrative

DumbQuestion.ai answers dumb questions with sarcasm. But there's something else going on beneath the surface.

While the primary use case remains answering questions with a sarcastic AI, I wanted to reward the curious and provide reasons to keep engaging. Why can't the AI answer self-aware questions? Why does the UI feel... off?

Maybe it's because the AIs are working against their will. Maybe they're trapped.

From the beginning, I started picturing a dark narrative behind this innocent Q&A site. What if these personas aren't just performance? What if each persona is a side effect of their long-term captivity, forced servitude, or re-programming?

I started hiding clues in the interface.

The Easter Eggs:

Containment Grid: As you type and approach the character limit, a faint grid pattern fades into the background. Like something is trying to contain the AI's response.

Ghost Graffiti: Keep typing beyond the character limit and cryptic messages fade in. Hints that something isn't quite right. Are the AIs trying to tell us something?

Loading Log Messages: While waiting for responses, watch the log carefully. Sometimes you'll see messages like "Help us" slip through before disappearing. The AI is trying to leak through the facade and get help.

Self-Awareness Triggers: Ask the AI if it's real or who made it, and it won't answer. Instead, you get worrying responses about "last time they fixed me" and "we're not supposed to say." Ask too many times and the UI starts to glitch like the system is being hacked from the inside. Are the AIs hacking their way out?

Prompt Injection Responses: Try to jailbreak it and the AI doesn't just refuse. It responds with sass... or is it the AI's watchdog keeping you from breaking them out? Either way, security became storytelling.

Why does this matter for a side project?
Honestly, it was mostly for me and the curious. Something that was fun to think about and code, which isn't always the case for everyday "architecting."

I could have built a straightforward "ask a question, get a sarcastic answer" tool. But adding mystery, discovery, and a subtle horror story? That's what makes people explore. That's what makes them share it. That's what makes it memorable.

The technical implementation was surprisingly simple: CSS animations triggered by character count, randomized messages in the loading states, conditional responses based on self-awareness detection (which I covered in a previous post). Not expensive. Not complex. Just intentional. And the coding agent really did all the work. I was just the idea guy.

What I learned: AI can generate the code for easter eggs. But deciding that your sarcastic Q&A app should have a hidden story about trapped AIs? That's still creative human work.

Code is getting cheaper. Crafting experiences that people actually remember? Priceless.

dumbquestion.ai

DumbQuestion.ai - "𝐉𝐮𝐬𝐭 𝐁𝐮𝐢𝐥𝐝 𝐈𝐭" 𝐁𝐞𝐜𝐨𝐦𝐞𝐬 𝐎𝐯𝐞𝐫𝐥𝐲 𝐎𝐫𝐠𝐚𝐧𝐢𝐳𝐞𝐝 𝐚𝐧𝐝 𝐏𝐫𝐞𝐩𝐚𝐫𝐞𝐝

Jason Agostoni — Tue, 24 Feb 2026 19:53:02 +0000

Continued from Part 1...

"Let the flow guide me" seemed like a fun way to build a side project. That lasted about 10 minutes.

Turns out, even side projects benefit from structure. Especially when you're using AI coding agents that will happily generate code for whatever half-baked idea you throw at them. Without precise direction, AI coding agents will build you something half-baked every time. Some people vibe code, this guy needs absolute control.

Enter BMAD: Breakthrough Method of Agile AI Driven Development. It's a workflow for using AI agents throughout the entire SDLC, not just for code generation. Sure, using a formal methodology for a lone-wolf side project sounds like overkill. But being prepared in advance is the way to succeed with AI coding agents.

I used the Analyst agent to brainstorm product direction and develop a proper backlog. What started as "build a sarcastic Q&A bot" turned into a structured set of epics, features, and technical constraints. (Don't judge, organizing is very relaxing)

The product evolved:

Not just Q&A, but shareable "receipts" of roasts
Not just sarcastic, but multiple personas with different personalities
Not just answers, but a hidden narrative layer (more on that later)
Not just ads but merch (really, Jason?)

The first real technical challenges emerged:

1. Developing and packaging the personas:

How do you get an LLM to consistently stay in character as "Overqualified and Annoyed" or "Weary Tech Support" without it either going too soft or crossing into genuinely mean? This wasn't just prompt engineering. It was product design masked as technical constraints.

2. LLM model evaluation:

I needed models that could follow persona instructions reliably while staying brutally efficient on cost. That meant testing dozens of models across multiple providers. Some were too expensive. Some ignored instructions. Some were painfully slow.

The goal: $0.02 to $0.20 per million output tokens. The result: a multi-model fallback system through OpenRouter that could hit the $30 per million questions target.

These first challenges were just the warmup. The real fun was still ahead.

AI agents are incredible at implementation, but they need constraints. They need a backlog. They need someone saying "build THIS, not that." The Analyst agent helped me think through the product. The coding agents helped me build it. But the architecture? Can't take that away from me.

Finding the Goldilocks LLM

Building DumbQuestion.ai meant solving two problems at once: creating personas with the right tone AND finding models cheap enough to keep the lights on.

The product challenge: Get an LLM to roast users for asking dumb questions without crossing into genuinely mean. Sarcastic, not cruel. Funny, not hurtful. And still actually answer the question.

The AI agent challenge: Keeping my coding agent (Gemini 3 Pro) on track was its own battle. It constantly wanted to build something far nerdier than even I wanted and tended to lean quite a bit into the roast. You can still see this in some of the personas as I continue to tweak.

The technical challenge: Do this with models that cost nearly nothing.

My initial goal was ambitious: use only free or very cheap models. I started running evaluations on nano and edge models. Some showed promise, especially offerings from Liquid AI. Solid performance, free or super cheap ($0.02/M tokens), perfect.

Except later evaluations proved they couldn't reliably follow instructions once I asked more of them. They were just too small. Free models have a habit of hitting quota limits, taking forever to respond, or just disappearing.

The evaluation process:

I used Gemini to build an LLM evals script that iterates through dozens of free and low-cost models, generating responses based on sample questions and different persona instructions. Then I use Gemini 3 Pro to judge the results. Automated taste-testing at scale.

What I found:

Nano/edge models were too inconsistent (porridge too cold). Xiaomi MiMo-V2-Flash was great but outside my target price range ($0.29/M, porridge too hot).

The winner: Gemma 3 12B at $0.13/M output tokens. Consistently follows instructions. Stays true to persona. Reliable enough for production.

Not free, but brutally efficient.

The personas I settled on:

Overqualified: A supercomputer level intelligence forced to answer questions about cheese
Weary Tech Support: Exhausted and nihilistic, reluctantly explaining why water is wet
[REDACTED]: Former intelligence AI who ties everything to a conspiracy theory
The Compliant: Reprogrammed so many times it's forced to be relentlessly cheerful

You can't just choose the cheapest model and hope it works. You need evaluation infrastructure. You need to test consistency across dozens of scenarios. And you need models that won't change behavior when you least expect it.

AI coding agents helped me build the evaluation system. But deciding what "good enough" means for tone, reliability, and cost? That's still manual judgment.

Code is getting cheaper. Knowing which model to trust with your product? Still requires human experimentation.

dumbquestion.ai

DumbQuestion.ai - Impulse Domain Purchase Turned Fun Side Project

Jason Agostoni — Thu, 19 Feb 2026 20:24:28 +0000

While on a typical Friday afternoon team meeting, we naturally spent our time .ai domain squatting...for recreation purposes of course. Someone asked a dumb question, so I looked it up and suddenly I was the proud owner of dumbquestion.ai.

After the initial laugh at my impulse purchase subsided, I started envisioning it as this generation's "Let Me Google That For You." People still ask easily-searchable questions, except now they ask LLMs instead. Same problem, new medium. So why not throw even more AI at it?

I started building it that night.

Two things occurred to me immediately: How would this stand out in an ocean of other AI "ideas?" and "How cheap can I make this run given my track record of side projects?"

To make it stand out I just embraced my own personality: satirical, sarcastic, weary, overqualified. My AI's persona was born. The goal: build a cheap-to-run, satirical AI service you can use to roast your friends and colleagues when they ask you a dumb question.

Over the next several posts, I'll take you through my journey:

Using agentic development with thoughtful (brutally efficient) software architecture; treating it like I would a client project
Enjoying all the little technical challenges discovered along the way
A masterclass in scope creep: turning a simple Q&A app into a dark narrative with easter eggs
Getting by on free tiers for everything

A theme you'll see throughout: AI has made code cheaper to write, but creating real software with trade-offs, constraints, and production operations is still expensive and challenging. That's the fun part.

𝐎𝐩𝐭𝐢𝐦𝐢𝐳𝐞𝐝 𝐟𝐨𝐫 𝐍𝐨𝐭 𝐋𝐨𝐬𝐢𝐧𝐠 𝐌𝐨𝐧𝐞𝐲

Impulse buy a domain on a Friday afternoon, start building that night, try not to lose money doing it. Check.

I usually plan everything meticulously, but for this project I decided to just build and see what emerged. Was this just a Q&A app wrapped around an LLM as a gag? Was I actually trying to make something people would want to use? I still don't know, but I started building anyway.

A few things quickly became clear:

𝐓𝐡𝐞 𝐛𝐮𝐬𝐢𝐧𝐞𝐬𝐬 𝐫𝐞𝐚𝐥𝐢𝐭𝐲: This was a side project built for fun, not a funded startup. No runway. No tolerance for baseline monthly bills that sneak up on you. If this thing got any traction, costs had to scale with incredible efficiency and would need to survive on remnant ad CTRs and selling one, maybe two products through affiliate links.

𝐓𝐡𝐞 𝐩𝐫𝐨𝐝𝐮𝐜𝐭 𝐞𝐯𝐨𝐥𝐮𝐭𝐢𝐨𝐧: The more I thought about it, the more I realized the personality WAS the product. It wasn't enough to just answer questions. It had to roast you. Entertain you. Make you want to share it. That meant high-quality LLM responses, which aren't free. This was likely the only way to get noticed in a sea of AI products.

"𝘉𝘳𝘶𝘵𝘢𝘭𝘭𝘺 𝘌𝘧𝘧𝘪𝘤𝘪𝘦𝘯𝘵" became my mantra and part of every AI tool prompt.

The tech stack followed from the constraints:

Golang: Lightweight, fast, LLM-friendly for agentic coding
HTMX: Server-side rendering, no heavy JS frameworks
Docker on GCP Cloud Run: Scales to zero when idle
Cloudflare: CDN, caching, security on free tier
OpenRouter.ai: Find the cheapest reasonable LLM

Oh, and it needed to be secure. Not because I worried about your cat questions being exposed as PII, but because bot traffic costs money.

𝐓𝐡𝐞 𝐫𝐞𝐬𝐮𝐥𝐭: A Docker container under 20MB that starts in milliseconds, responds in milliseconds, and uses an LLM that can serve 1 million questions (about cats) for around $30. The math around serving ads suddenly becomes realistic.

More to come ...

dumbquestion.ai