Forem: Benjian Dai

I ran my idea-validation product through its own validator. The verdict was PIVOT.

Benjian Dai — Sat, 23 May 2026 06:57:18 +0000

Last week I ran a user's idea through Pro Validate, the AI validator I built into MonetScope. It came back PIVOT, 65% confidence. Not PROCEED, not PAUSE. PIVOT.

I sat with that for about ten seconds. Then a more uncomfortable question showed up: if my validator says PIVOT to her idea, what would it say to my own product?

So I ran MonetScope through MonetScope.

Verdict: PIVOT, 68% confidence.

The result was both reassuring and brutal. Reassuring because it proved the tool isn't built to flatter. Brutal because the three reasons it gave me were exactly the parts I'd been quietly avoiding.

Why I Did This

Most "AI idea validator" tools have a credibility problem. You feed them an idea, you get back something that sounds smart, and you have no idea whether the answer would be different if you'd typed something completely different. The same input doesn't always give the same output. The output doesn't disclose its evidence. It just confidently says yes.

That's why I built Pro Validate the way I did. Every signal in the report links back to actual Reddit, Hacker News, or X posts that inform it. But there's a deeper test I hadn't run on myself: would the tool tell ME no?

That question is the only thing that matters. Because if your validator gives PROCEED to every input that sounds plausible (including your own product description), it's not a validator. It's a mirror.

So I wrote the most honest version of MonetScope's pitch I could, pasted it into the form, and hit Validate.

Idea: A web platform that mines Reddit, Hacker News, and X for validated user pain points, scores them across 11 dimensions, and gives founders a PROCEED / PIVOT / PAUSE verdict on ideas they submit. 12,000+ pre-validated opportunities in the database.

Target user: Indie founders and SaaS builders looking to either find their next idea or validate one they already have.

Monetization: Subscription. Free tier limited, paid tier unlocks the AI verdict, deep analysis reports, and opportunity monitoring.

Before clicking submit, I wrote down my prediction: PIVOT, somewhere between 60-70% confidence. The space had at least 5 competitors I could name from memory. The product naming ("MonetScope") isn't self-explanatory. WTP signals from indie founders are notoriously weak.

I expected pushback. I got more than pushback.

68% confidence. My prediction was on the dot. But that was the only comforting part of the report.

The Three Critiques

Critique 1: "15 highly similar matched opportunities (top 82.8% similarity)"

My first reaction: "fifteen direct competitors? I knew of five."

Then I read it more carefully. These weren't 15 existing competitors. They were 15 entries in my own database. Independent clusters of pain signals from Reddit, HN, and X, all describing the same shape: "founders need a way to extract validated startup ideas from forum signals."

In other words: my own product had surfaced 15 separate groups of people on the internet asking, in slightly different words, for what MonetScope does.

That changes the read of the data entirely. 15 highly similar matches isn't a saturation signal. It's a demand signal. The market is real and is being articulated by independent voices.

What it doesn't change is the second critique.

Critique 2: "Zero direct WTP mentions"

This one was harder to sit with.

Across 34 evidence quotes from matched opportunities, zero of them contained phrases like "I'd pay for" or "shut up and take my money." Pain is everywhere. Willingness-to-pay is invisible. And the existing direct competitors in this space (Product Hunt, Indie Hackers, Starter Story, ChatGPT) are all free.

The pricing band Pro Validate assigned me ("Free to $20") puts MonetScope in direct head-to-head with established free incumbents. That's not a winning position. That's a position where users compare you to free and shrug.

Reading this, I realized I'd been silently treating "monthly subscription" as the obvious answer. The data was telling me to stop treating it as obvious and actually validate whether the founder ICP would pay, at what point, and for what specific output.

Critique 3: "Without sharp differentiation"

This is the one that should have been easiest to argue with. I have an 11-dimension scoring model. Evidence trails linking to source posts. A B2B API. A pre-validated database. There's plenty of differentiation, technically.

But "technically differentiated" and "differentiation that lands in the buyer's head" are different things.

And here's where the case study stops being about Pro Validate and starts being about a stranger pattern.

In the same week I ran this self-test, two other independent signals arrived saying the same thing in different words:

Signal 1 (Pro Validate): "Many free/alternative tools make paid conversion challenging without sharp differentiation."

Signal 2 (a positioning consultant who cold-emailed me out of nowhere):

"On the first screen, there are several trust-building claims at once. AI-curated, real pain, validated commercial potential, 11-dimension scoring. But I think one concrete opportunity example with a crisp 'why trust this score' explanation would do more work than the stack of abstractions."

Signal 3 (an actual user who'd signed up and was confused):

"The wording around 'opportunities' and the overall presentation gave me the impression that the platform could also help founders connect with potential buyers, partners, or commercialization opportunities for their projects."

Three independent paths. My own tool. A stranger consultant. A real user. Different audiences, different language, same diagnosis: the differentiation isn't sharp enough, and the positioning leaks in ways I hadn't seen.

When external sources start saying the same thing through different channels, it's not feedback anymore. It's the diagnosis.

The Playbook

Pro Validate doesn't just give a verdict. It gives a Validation Playbook with specific actions ranked by priority.

The four P0 items it surfaced for MonetScope:

Run 50 test validations on real user-submitted ideas and measure verdict usefulness (1 week)
Interview 15 recent users about willingness to pay for deeper analysis (2 days)
Audit top 5 competitors' free tiers to map exact feature gaps vs your paid offering (3 days)
Track conversion rate from free to paid within the first 30 users (ongoing)

Notice what's missing from this list: "ship more features." The verdict isn't telling me to build. It's telling me to talk to people, audit incumbents, and measure what's already happening before adding anything new.

That's what a PIVOT verdict actually means. Not "kill it." Not "rebuild it." It means: validate adoption friction and WTP before building any more.

What This Teaches Me About Idea Validators

Three things I think matter, beyond MonetScope specifically.

The honesty test: If a validator gives PROCEED 95% to every product description that sounds plausible, the tool is broken. PIVOT, calibrated against actual evidence, is the only result that proves the validator is doing real work. The day I get a PROCEED verdict on something I know is a bad idea, I have to retire the tool.

The depth test: A vague "needs work" verdict is useless. The reason Pro Validate's PIVOT was actionable is that it gave me four specific P0 actions, each with a hypothesis to test and an estimated effort. That's the difference between a horoscope and a diagnosis.

The blind-spot test: My own product surfaced a problem I'd been avoiding. A stranger consultant independently pointed at the same problem. An actual user, in a completely different context, pointed at the same problem through a totally different angle (the word "opportunities" being read as "commercialization opportunities"). External signals stack. They override the founder's ego eventually.

What I'm Doing About It

Two things in motion, neither of them "ship more features."

First: I'm running the WTP interviews this week. 15 of them, focused on the founders who've already touched the paid tier. The verdict was right that indie founder WTP is the question I haven't actually answered. I've been defending the current price. Time to find out what the actual ladder should be.

Second: I'm auditing the landing page copy this weekend. When a real user reads "opportunities" as "commercialization opportunities for my project," that's not a language nitpick. It's a positioning leak. The word is doing the wrong work in the buyer's head.

I'll write another post in 4 weeks with what came back.

If you want to try Pro Validate on your own idea (and see whether it tells you PIVOT or PROCEED), it's at monetscope.com/validate/pro. Honest disclosure: the AI verdict feature is paid. The basic idea validator is free.

The most useful thing I can promise is that it won't tell you what you want to hear unless the data actually says you should hear it.

I shipped an AI pipeline in a month that reads Reddit, HN, and X for startup ideas. The hardest part wasn't the AI.

Benjian Dai — Tue, 28 Apr 2026 13:30:23 +0000

For the last month I've been building MonetScope — a pipeline that crawls Reddit, Hacker News, and X, reads what real people are complaining about, and surfaces the complaints as scored startup opportunities.

Going in, I assumed the hard part would be the AI layer. You know the story: prompt engineering, structured output, temperature tuning. That's where the demos happen and where most of the blog posts get written.

It wasn't. The LLM layer landed roughly on schedule. What took real engineering time were four other things — each one taught me something I'll carry into the next pipeline I build. Plus one deeply dumb cross-language serialization bug that nearly corrupted my data for a week without me noticing.

This is a write-up of those things.

The pipeline, very roughly

Before we dig in, the mental model. I'm keeping this deliberately abstract because the interesting part is the categories, not my specific libraries.

   [Reddit]    [Hacker News]    [X]
       \            |            /
        \           |           /
         +---> crawler layer <---+
                    |
             message queue
                    |
          deterministic filters
                    |
           multi-stage LLM layer
                    |
            grounding + storage
                    |
                 product UI

Two runtimes: one for crawlers (good at "get data out of places"), one for orchestration plus API (good at "stay up under load"). A message queue between them. Boring, intentionally.

What's on that diagram today is three public platforms. What's on my whiteboard is more — additional communities, niche forums, and eventually a user-submitted channel where a founder can drop in their own support tickets, a competitor's review stream, or a CSV dump from a private Slack. Each new source is just another input box on the diagram, which is the whole point of having the diagram at this level of abstraction. The cost of adding source N+1 is not in the pipeline; it's in the per-platform quality heuristics, and that's a problem I enjoy having.

Now — the four things that took more time than the AI. Starting with the most embarrassing.

1. The scraping tool was wrong for two of three platforms

Every scraping tutorial you'll find online opens with the same heavyweight toolkit — headless browser, stealth plugins, proxy rotation, the works. I started there too.

This turned out to be wrong for one platform, unnecessary for another, and actively dumb to avoid on the third.

Platform A: I had a headless browser dutifully rendering pages for three weeks before I realized the same data was available through a much thinner path that didn't require rendering a single pixel. When I rewrote that spider it went from "needs a beefy worker" to "runs on a potato." I want those three weeks back.
Platform B: There was a developer-focused API available the entire time. Not just scrape-able, officially supported. I was reinventing its existence in the browser layer. Pure hubris — I had assumed "if the tutorial uses a browser for it, that must be the right tool."
Platform C: No shortcut. Actively hostile to scraping, continuous cat-and-mouse, and my time is worth more than the subscription. Paid for access. Never looked back.

The generalizable lesson: don't start with a tool and hunt for problems to solve with it. Start with how the source actually serves data to its own frontend. If it serves structured data, there's usually a path that isn't a browser. If it only renders via JS, that's your answer. If it's hostile, pay or skip.

The heavyweight-browser default isn't wrong — it's a good fallback when nothing else works. The failure mode is treating it as the starting point.

2. The cheapest filter is the one that runs before the LLM

Naive version of the pipeline: crawled post arrives → LLM processes it → structured output goes to the database.

Two problems at once.

It's expensive. Every post is tokens, and token cost on a content-scale pipeline dominates the bill within a week.

The output is worse. "Why is no one building X?" posts waste tokens producing confident-sounding opportunity cards that don't survive human review. A model asked to extract an opportunity from a substance-free rant will dutifully hallucinate one.

The fix is philosophically simple: put a deterministic filter in front of the LLM that rejects content the LLM would have rejected anyway, but for free.

What "deterministic" means varies per platform — what counts as a substantive post in one community is a throwaway in another. The thresholds are per-platform, and they drift as community norms change. I'm not going to publish the current values; they're part of the product. But the interesting thing about tuning them isn't the numbers. It's that I ended up building a small tuning harness that was more work than picking any individual threshold.

Ordering matters too. The deterministic filter runs before the stateful dedup layer, not after. Posts rejected today can be reconsidered tomorrow if they accumulate engagement — which they sometimes do, especially on HN.

Generalizable rule: before every LLM call, ask "can I cheaply reject this input first?" The answer is usually yes, and the win compounds: less cost per document, better signal on the documents that make it through, fewer false positives to clean up downstream.

3. Shipping an "AI-grounded" feature without lying to users

This is the section I most want a reader to take away, so I'll be careful about the level I pitch it at.

The product makes a specific promise: every claim it generates is backed by a quote from an identifiable, real user. You see "users complain that X breaks every Tuesday" and you can click through to the exact comment where someone said exactly that.

If that promise leaks — if even a small percentage of the quotes are paraphrased, massaged, or invented — the product has no reason to exist. "We summarize Reddit with AI" is a commodity. "We show you the literal thing the person said" is not.

LLMs in their default configuration will break that promise. Paraphrasing is what they're good at. Making up a plausible-sounding quote is easier for them than surfacing the specific boring one that matters. This isn't a flaw of any particular model; it's a property of optimized-for-fluency generation.

The approach I landed on, at the pattern level:

Don't ask the LLM to find sources. Do extraction of candidate source material deterministically, before the generation step. The LLM sees a curated candidate pool, not the raw corpus.
Constrain generation to operate on those candidates. The LLM's job is to synthesize and structure. It is not the layer that decides what's citable.
Mechanically verify every output claim against a specific source record. If it doesn't match a real record, it doesn't ship. This is the step most pipelines skip, and it's the one that determines whether users still trust the output six months in.
Fail closed. If verification can't find the source for a generated claim, drop the claim. If dropping claims leaves the output empty, drop the whole output. Empty is fine. Phantom is not.

I won't walk through the matching algorithm, what the candidate pool looks like in this codebase, or how I decide "drop claim vs drop whole output" — those are product-specific tuning and they're where the moat actually lives.

The pattern itself is freely available, and I wish I'd seen it articulated when I started: for any LLM feature where hallucination is a correctness bug rather than a style flaw, the pattern is pre-extract → constrain → verify → fail closed. Half-measures ship, but they don't keep trust.

4. One score is almost never enough

The hardest non-infrastructure problem turned out to be teaching the pipeline the difference between "people are angry" and "people will pay."

The naive move is one score per item — how good is this opportunity, 0-10. This doesn't work. It tries to answer two reader questions at once ("is there real pain here?" and "would anyone actually buy a solution?") and those are orthogonal.

A thread full of "someone should build X" is high on the first and near-zero on the second. A thread where one person has duct-taped three tools together and is actively shopping for a replacement is moderate on the first and very high on the second. A single composite score collapses those into noise.

The fix isn't clever math. It's the recognition that any time a single number is answering more than one question, it ends up answering none of them well. Split the signal into the questions you actually want answered, score those separately, and compose them deliberately — not average them. A lot of weak signal does not beat a little strong signal, even when the means come out similar.

One non-obvious finding I'll share because it's directional-only: the "obvious" communities (the big generalist ones) produce noisier signal than niche ones. Volume is a weak proxy for signal strength. Source diversity turned out to matter as much as source volume — an opportunity drawn from three different niche communities beats one drawn from thirty posts in one big general community, even though the raw evidence count is an order of magnitude lower.

Dev-applicable version: when a ranking algorithm isn't working, check whether you're averaging two signals that are answering different questions. You'll be surprised how often the answer is yes.

The dumb bug that almost invalidated everything

This one isn't a product moat — it's a general cross-language gotcha — so I'll share it in detail because the lesson is broadly useful to anyone building a polyglot system.

Two services in two languages, both writing to the same database column. The column stores vectors as text, like [0.123,0.456,...]. The database happily accepts whatever either language produces.

The trap: each language's default float-to-string produces slightly different output. Different widths. Different rounding at the edge case. To the human eye they look identical. To byte-wise comparison they aren't. To cosine similarity on the resulting vector, they're close but not the same vector.

Nothing broke. No exception. No type error. No failing test. What happened was that semantic similarity rankings drifted depending on which service had last written the record. Results were good, then mysteriously slightly bad, then good again. I chased this for most of a week before I realized what I was looking at.

The fix, in pseudocode:

// One canonical formatter, shared definition across both runtimes.
// Verified by golden-string test in each language.
format_vector(floats):
    return "[" + join(",", each f -> to_string_exact(f)) + "]"

Where to_string_exact is explicitly pinned to the widest-precision, culture-invariant format available in each language — not whatever the default toString() happens to do. And the test is a literal string equality check against a hand-written golden output, run from both sides.

The broader lesson: "both sides are using the default" is a dangerous sentence in a polyglot system. Default serialization isn't a contract. If two runtimes are going to share a serialized format, write the format exactly once as a pure function, and verify its output byte-for-byte from both languages. Repeat for every format that crosses the runtime boundary — JSON casing was the other one that bit me, in passing.

What I'd do differently

Nothing kills a launch post faster than "everything went great." So:

I picked the wrong primary data store early, for the wrong reason — "flexibility." Future me didn't want flexibility. Future me wanted fewer moving parts. Moved to a boring relational DB with a JSON column type and things got better immediately. Lost about three weeks.
I wrote my own rate-limit layer before realizing a standard caching-server primitive plus ten lines of script would have done the same job in an afternoon. Lost a week on that one.
I underestimated observability. The liveness vs. readiness healthcheck split only happened after the third production incident. It should have been there on day one — you don't need it until you need it, and then you need it immediately.
The grounding / verification layer shipped too late. Weeks of early data had to be re-processed once I added it. It should have been part of the first LLM call, not the twelfth.

If there's a theme: my worst decisions were the ones where I picked the more flexible option so future-me would have more options. Future-me didn't want options. Future-me wanted something that worked.

Closing

That's four things that turned out harder than the AI, plus the serialization bug for flavor.

The product all this plumbing is in service of is at monetscope.com — free 14-day trial, no card. If you'd rather see what it produces before committing, this week's top 10 opportunities are at monetscope.com/this-week (just email, no card). The output is what the pipeline exists for, but to be honest the feedback I most want right now is on the engineering choices above, not the landing page.

If you've shipped an LLM-scored content pipeline yourself, I'd genuinely like to hear: how do you version and regression-test your prompts? That's the layer I feel weakest on, and I haven't found a tool I love. Current setup is git commit plus a hand-maintained regression set, and it's starting to creak as the prompt surface grows.

Also: if you maintain a community, newsletter, or data source you'd be interested in seeing indexed — that's on my roadmap, and I'm actively looking for source-expansion partnerships. DM me.

Thanks for reading this far.