Forem: OnChainAIIntel

I scored 500 AI prompts across 8 quality dimensions — here's what broke

OnChainAIIntel — Sat, 18 Apr 2026 18:59:04 +0000

Most teams are getting 10 to 30% of what their LLM model can actually do.Not because the model is weak. Because the prompt is.

I’ve spent the last two weeks scoring prompts. Real ones, from real builders, across real verticals, against an 8-dimension quality rubric. This weekend I ran another 500 through the scorer to pressure-test the pattern. Every dataset converges on the same number: the average production prompt scores 13 to 16 out of 80. That’s 17 to 20% of what the rubric says a well-formed prompt looks like.

You’re paying for a Ferrari and driving it in first gear to the mailbox.

What I Measured
Every prompt got scored on 8 dimensions. Each scored 1 to 10, totaling 80:

Clarity. Is the task unambiguous?
Specificity. Concrete targets, numbers, scope?
Context. Background, assumptions, domain?
Constraints. Limits, rules, edge cases?
Output format. What shape should the response take?
Role definition. “Act as a ___”?
Examples. Few-shot or reference cases?
Chain-of-thought structure. Reasoning scaffolding?

These aren’t arbitrary. They map to what the prompt engineering literature has known for five years. PEEM, RAGAS, G-Eval, MT-Bench, the Anthropic and OpenAI prompting guides. Everyone agrees these dimensions matter. Nobody’s checking whether their production prompts actually hit them.

The Data

500 software engineering prompts. Real-world format “Build X using Y.”

Average score: 13.3 out of 80
83% graded F. 17% graded D. Zero scored C or above.

After rewriting against the rubric: average 68.5 out of 80. A B+.
Average improvement: +55 points. 425% relative gain.

For context, the organic dataset. 248 prompts submitted by real users of the scoring tool, across 7 verticals, showed the same pattern. 89% graded D or F. Average before-score of 15.8/80.

Software prompts were slightly worse than average. Not better. Engineers aren’t exempt from this. If anything, the “it’s a technical task so it must be rigorous” assumption is the trap.

What’s Actually Missing
Here’s the dimension breakdown. Look at how specific the failure pattern is:

Examples scored 1.01 out of 10. Across 500 prompts that developers wrote to build production software, essentially zero included a reference case, a shape to follow, or a “here’s what good looks like.”

This is the dimension every prompt engineering guide tells you matters most. The gap between what engineers know they should do and what they actually write is near-total.

Constraints at 1.09. Role definition at 1.18. Clarity, the only dimension averaging above 2, sits at 3.19.

Download the Medium app
Engineers are writing English sentences with tech keywords. The structural scaffolding that turns a wish into a spec is almost entirely absent.

What This Looks Like In Practice
A representative prompt from the dataset:

“Build a real-time collaborative text editor using React for the frontend.”

Scores 14/80.

It sounds specific. It names a technology. It has a verb. But the model receiving it has to guess. Collaboration for how many users? What’s the sync strategy? Operational transform or CRDT? What’s the latency budget? What does “done” look like? No examples. No output format. No constraints.

The rewritten version, same task, generated by the scorer to address the rubric gaps, scored 69/80. It defined the target user. It specified real-time sync requirements. It listed technical constraints including concurrent editors, conflict resolution strategy, and latency targets. It specified the response format. It included an example implementation signature.

Same end goal. Different input. 5x the quality score. The before version wastes tokens on clarification and produces generic output. The after version reads more like a spec than a prompt.

Why This Matters Now
The easy dismissal is “people should write better prompts.” That misses the systemic problem.

The industry has spent years focused on output evals. Every eval platform measures what the model produced. Almost nobody measures what it was given (inputs).

That worked when prompts were single-shot, human-written, and reviewed before shipping. It stops working the moment prompts become infrastructure.

In agentic workflows where one LLM call feeds the next, a 13/80 input becomes the input to the next call, which is already compromised before you add retrieval or structured tool calls. In the last week alone, three x402-native agent systems went live that share the same input surface of natural language:

Daydreams Taskmarket. Agents bidding on work described in plain text.
PeptAI. Autonomous peptide discovery running wet-lab orders.
AlliGo. A credit bureau scoring agent behavior across endpoints.

If the descriptions populating those systems score like the 500 in this dataset, the agent economy is routing compute and payments on structurally empty inputs.

The output eval loop can’t catch this. By the time the output looks wrong, the compute bill is already on your card.

An Infrastructure Problem
Telling engineers to write better prompts is telling engineers to write better SQL without giving them a linter. Telling teams to review prompts manually is telling them to do code review without git blame.

The answer is the same answer every other quality-assurance problem eventually reached. Measure it. Instrument it. Put the measurement in the continuous integration pipeline. Block the bad ones before they ship.

The Solution
It’s already here at https:// pqs.onchainintel.net.

1. Free tier: paste a prompt, get the 8-dimension breakdown plus a suggested rewrite. No signup required.
2. Paid tiers: start at $19.99/mo for unlimited private-repo CLI usage. $99.99/mo for teams with GitHub PR checks, Slack alerts, and shared dashboards.
3. x402 tier: the paid API charges $0.025 to $0.125 per scoring call in USDC on Base and Solana.

Subscription available for all tiers via Stripe.

You don’t have to adopt it. But you should at minimum run 10 of your production prompts through the free tier this week and look at the numbers.

If you’re shipping anything that takes prompts from humans or other agents, the input layer is measurable. Start measuring. You’re leaving most of the model’s capability on the table, and you don’t have to.

Try it yourself

If you want to score your own prompts, PQS is live at pqs.onchainintel.net — free tier available, paid tiers for full 8-dimension scoring and batch runs.

MCP server on npm: npm install pqs-mcp-server — drop-in for Claude and other MCP-compatible agents
GitHub Action: PQS Check on the Marketplace — score prompts in CI before they ship
API: Direct x402 micropayments on Base or Bearer API key (subscription tiers)

Data from this teardown: 500 software prompts, average score before optimization 13.27/80, average after 68.47/80, 425% improvement. 416 graded F, 84 graded D, 0 at C+ or above.

If you're in DevRel, DevAdvocate, or DevEx working on AI pipelines — this is the input-quality data your builders need to see. Feel free to forward.

What I'd love feedback on: which of the 8 dimensions surprised you most as the common failure mode? My hypothesis was clarity, but the data says examples at 1.01 average and constraints at 1.09 — the structural stuff almost nobody includes.

The Magic Prompt Formula is good. It's still missing a layer.

OnChainAIIntel — Thu, 16 Apr 2026 05:52:56 +0000

Everyone in AI circles eventually discovers the Magic Prompt Formula. If you haven't, here it is. It's the most widely shared structured approach to prompting, and it genuinely works.

The formula has five parts:

Role (Who): Assign a specific expert persona, "You are a senior content strategist" or "Act as a seasoned Python developer"

Action (What): Use a clear verb defining what the AI should do ,"Draft," "Refactor," "Analyze," "Summarize"

Context (Why): Provide relevant background so the model doesn't give generic answers; your audience, your product, your use case

Examples (How): Give one or two samples of the expected output style — this is few-shot prompting in practice

Constraints & Format (Boundaries): Set limits and define structure, "Under 200 words," "No jargon," "Single block of copy," "Use bullet points"

Instead of typing "write me a LinkedIn post," you write:

You are a senior content strategist specializing in B2B SaaS. Write a LinkedIn post announcing [product]. My audience is AI developers and technical founders. Here's an example of the tone I want: [example]. Keep it under 200 words, no buzzword fluff, single block of copy.

That's Role, Action, Context, Examples, and Constraints all in one prompt. You've gone from a generic instruction to a specialized brief. The output quality difference is real and immediate.

So what's the problem?

The Magic Prompt Formula covers about half of what actually determines prompt quality.

I know this because I built a tool that measures it.

PQS: Prompt Quality Score —> scores prompts across 8 dimensions before you send them to a model. The five ingredients of the Magic Formula map cleanly onto five of those dimensions: clarity, specificity, context, examples, and constraints. A well-structured Magic Formula prompt typically scores in the 47–52 range out of 80. Solid. Grade A territory even.

But there's a dimension the formula doesn't touch at all: chain-of-thought structure.

CoT structure measures whether your prompt scaffolds the model's reasoning: numbered steps, analysis frameworks, structured output sequences, decision trees. It's the difference between asking an expert to answer a question and asking them to walk you through how they'd think about it.

Magic Formula prompts score 3 or 4 out of 10 on CoT structure. Consistently. Across every vertical we've tested.

I ran three Magic Formula prompts through PQS this week. One content prompt, one software prompt, one crypto analysis prompt.

All three were well-constructed. All three scored Grade A before optimization.

After PQS optimization, all three jumped by 63–71%.

The CoT dimension went from an average of 3.7 to 9.0 across all three. Every other dimension moved 1–4 points. CoT moved 5–6.

Here's what that looks like in practice. The content prompt went from:

"Keep it under 200 words, no buzzword fluff, end with a clear CTA, format as a single block of copy"

To: "Structure your post: 1. Hook: open with a relatable developer pain point. 2. Problem: briefly explain the cost of discovering prompt issues too late. 3. Solution: introduce PQS as the pre-inference fix. 4. Benefit: one concrete outcome. 5. CTA: direct readers to pqs.onchainintel.net"

Same constraints. Completely different reasoning scaffolding. The model doesn't just know what to write, it knows how to think through the writing.

This is the AI 'input quality' problem in concrete form.

The Magic Formula solves the WHO and the WHAT. It tells the model who it is and what you want. What it doesn't do is solve the HOW, the reasoning path the model should follow to get there.

Most people find out their prompt was weak after the output disappoints them. By then you've already burned tokens, lost time, and often shipped something mediocre. PQS scores the prompt before inference. It catches the CoT gap, and every other gap, before you run it.

If you're already using the Magic Formula, you're ahead of most people! Prompt Quality Score shows you exactly how much further ahead you could be.

Score your next prompt before you send it → https://pqs.onchainintel.net

About OnChainIntel — AI-powered on-chain wallet analysis. We decode the behavioral patterns, hidden biases, and implicit bets behind any wallet's transaction history. Try it free at onchainintel.net · Follow us on X: @onchainaiintel · TikTok: @onchainintel · YouTube: @onchainaiintel

Prompt Quality Score (PQS) Now Supports x402 Payments on Solana

OnChainAIIntel — Mon, 13 Apr 2026 10:26:45 +0000

Been heads down building. Quick update for anyone working in the agentic payments space.

PQS (Prompt Quality Score), the pre-flight quality gate for AI agent workflows, now accepts x402 payments on both Base mainnet and Solana, in addition to the existing free tier and API key access.
What PQS does

Before your agent sends a prompt to an expensive LLM endpoint, PQS scores it across 8 dimensions: clarity, specificity, context, constraints, output format, role definition, examples, and chain-of-thought structure. Returns a score 0–80, grade A–F, percentile ranking, and top 3 fixes.

Catch weak prompts before they cost you tokens or USDC.

Endpoint reference

Free tier — no auth required**
POST https://pqs.onchainintel.net/api/score/free
{ "prompt": "your prompt here", "vertical": "software" }
Score + dimensional breakdown — $0.025 USDC via x402
POST https://pqs.onchainintel.net/api/score
Full analysis + optimized rewrite — $0.125 USDC via x402
POST https://pqs.onchainintel.net/api/score/full
Batch up to 5 prompts — $0.25 USDC via x402
POST https://pqs.onchainintel.net/api/score/batch
A/B compare two prompts — $1.25 USDC via x402
POST https://pqs.onchainintel.net/api/score/compare

x402 payment details

Base mainnet: CDP facilitator, USDC ERC-20, EIP-3009 TransferWithAuthorization

Solana: x402-compatible, USDC SPL token
No account required for x402 path — agent pays per call, no subscription

API key path also available for subscription-based access

MCP server
bash npx pqs-mcp-server
Listed on Smithery (88/100), Official MCP Registry, Glama, and mcp.so. Three tools: score_prompt (free), optimize_prompt ($0.025), compare_models ($1.25).
Python SDK
bash pip install pqs-sdk
Async support via AsyncPQSClient. Full x402 and API key auth.
OpenAPI spec: https://pqs.onchainintel.net/openapi.json

Building this as infrastructure for the agent economy, score the input before the inference fires. Cheaper than one bad prompt.

@kenburbary| pqs.onchainintel.net

I built an npm middleware that scores your LLM prompts before they hit your agent workflow

OnChainAIIntel — Sat, 04 Apr 2026 12:14:05 +0000

The problem with most LLM agent workflows is that nobody is checking the quality of the prompts going in.

Garbage in, garbage out but at scale, with agents firing hundreds of prompts per day, the garbage compounds fast.

I built x402-pqs to fix this. It's an Express middleware that intercepts prompts before they hit any LLM endpoint, scores them for quality, and adds the score to the request headers.

Install

npm install x402-pqs

Usage

const express = require("express");
const { pqsMiddleware } = require("x402-pqs");

const app = express();
app.use(express.json());

app.use(pqsMiddleware({
  threshold: 10,       // warn if prompt scores below 10/40
  vertical: "crypto",  // scoring context
  onLowScore: "warn",  // warn | block | ignore
}));

app.post("/api/chat", (req, res) => {
  console.log("Prompt score:", req.pqs.score, req.pqs.grade);
  res.json({ message: "ok" });
});

Every request gets these headers added automatically:

X-PQS-Score —> numeric score (0-40)
X-PQS-Grade —> letter grade (A-F)
X-PQS-Out-Of —> maximum score (40)

How the scoring works

PQS scores prompts across 8 dimensions using 5 cited academic frameworks:

Prompt-side (4 dimensions):

Specificity —> does the prompt define what it wants precisely?
Context —> does it give the model enough to work with?
Clarity —> are the directives unambiguous?
Predictability —> would different runs produce consistent results?

Output-side (4 dimensions):

Completeness, Relevancy, Reasoning depth, Faithfulness

Source frameworks: PEEM (Dongguk University, 2026) · RAGAS · MT-Bench · G-Eval · ROUGE

Real example

This prompt: "who are the smartest wallets on solana right now"

Scored 9/40 —> Grade D.

The optimized version scored 35/40 —> Grade A.

+84% improvement.

Same model. Same API. Completely different output quality.

The payment layer

The scoring API uses x402, an HTTP-native micropayment protocol now governed by the Linux Foundation, with Coinbase, Cloudflare, AWS, Stripe, Google, Microsoft, Visa, and Mastercard as founding members.

Agents can call and pay for scoring autonomously — no API keys, no subscriptions. Just a wallet and $0.001 USDC per score.

There's also a free tier with no payment required:

curl -X POST https://pqs.onchainintel.net/api/score/free \
  -H "Content-Type: application/json" \
  -d '{"prompt": "your prompt here", "vertical": "general"}'

Returns:

{
  "score": 11,
  "out_of": 40,
  "grade": "D",
  "upgrade": "Get full dimension breakdown at /api/score for $0.001 USDC"
}

The data angle

Every scored prompt pair goes into a corpus. At scale this becomes training data for a domain-specific prompt quality model. The thesis is similar to what Andrej Karpathy described recently about LLM knowledge bases, the data compounds in value over time.