Forem: CPDForge

The Most Dangerous AI Output Isn’t Wrong — It’s “Almost Right”

CPDForge — Sat, 04 Apr 2026 09:19:47 +0000

Most people think the biggest risk with AI is hallucination.

Completely wrong answers.

Obvious mistakes.

Stuff you can spot instantly.

That’s not what caused problems for us.

The real issue showed up later — once things looked like they were working.

The outputs weren’t wrong.

They were almost right.

And that’s a much harder problem to deal with.

Why “Almost Right” Is Worse Than Wrong

If something is clearly wrong, you catch it.

You fix it.

You move on.

But when something is:

90% correct
Well structured
Confidently written

…it passes through unnoticed.

And that’s where systems start to break.

What This Looks Like in Practice

These weren’t big failures.

They were small, subtle ones:

A field slightly misclassified
A rule applied in the wrong context
A structure that looks valid but doesn’t align with the system

Individually, they don’t matter.

At scale, they compound.

The Real Problem: AI Stabilises Its Own Mistakes

Here’s what we realised:

AI doesn’t just generate errors — it reinforces them.

Once a slightly incorrect pattern appears, the model tends to:

Repeat it
Expand on it
Make it look more consistent over time

So instead of random errors, you get:

Clean, consistent, wrong outputs.

Which are much harder to detect.

Why This Happens

AI isn’t reasoning in the way we expect.

It’s optimising for:

Coherence
Pattern completion
Internal consistency

Not correctness.

So if an early assumption is slightly off, the model will build a very convincing version of reality around it.

Where This Breaks Real Systems

This becomes critical when AI is used for:

Structured content generation
Compliance or policy outputs
Anything reused or scaled

Because now you don’t just have an error.

You have:

A repeatable error
A scalable error
A system-level error

What We Changed

We stopped trusting “good-looking outputs.”

Instead, we built around one principle:

Every output is suspect until proven stable.

1. Pattern Detection Over Single Output Review

Instead of asking:
“Is this output correct?”

We ask:
“Is this pattern consistently correct across outputs?”

This exposes hidden drift fast.

2. Intent vs Output Validation

We separate:

What the system is supposed to do
What the AI actually produced

Then compare them explicitly.

If they don’t align, it fails — even if it looks right.

3. Breaking the Feedback Loop

We avoid feeding AI its own outputs without checks.

Because that’s how:

Small errors become reinforced patterns become system behaviour

The Counterintuitive Bit

Making outputs more polished made the problem worse.

Cleaner language increases trust.

More trust reduces scrutiny.

Which allows bad patterns to survive longer.

Why This Matters Right Now

A lot of AI tooling is focused on:

Making outputs better
Making them more human
Making them more polished

But that increases risk if you’re not validating underneath.

The Takeaway

If your AI outputs look great but your system still feels unreliable:

You’re probably dealing with “almost right” errors.

And those are much harder to catch than obvious failures.

Question for Anyone Building with AI

If you’re using AI in production workflows:

What breaks first when you scale?
Do you validate outputs, or just trust them if they look good?
Have you run into “clean but wrong” behaviour?

Genuinely curious how others are handling this.

From Prompts to Systems: Fixing AI Agent Drift in Production

CPDForge — Mon, 30 Mar 2026 13:50:55 +0000

Why My AI Agent Kept Getting Things Wrong (And What Actually Fixed It)

At first, it worked.

I gave the AI a clear prompt. It responded well. Structured, relevant, even a bit impressive.

Then I tried again.

Same prompt. Slightly different output.

Then again — and something felt off.

Not completely wrong… just inconsistent.

That’s when it became a problem.

Because I wasn’t building a demo. I was building a product.

The Problem: “Almost Right” Is Not Good Enough

When you’re working with LLMs in isolation, variability is fine. Even interesting.

When you’re building something people rely on — it isn’t.

I started seeing patterns:

Outputs drifting in structure
Key instructions being ignored
Tone and formatting changing between runs
Occasionally… things just made up

Nothing catastrophic. Just unreliable.

And that’s worse.

Because you can’t trust it.

The Context: This Wasn’t Just a Chatbot

One important detail — this wasn’t an internal tool or a sandbox experiment.

This was a user-facing AI agent, interacting with both:

logged-in users (with context, data, and history)
prospective users (with no context at all)

Which meant I effectively needed two behaviours:

one that could operate with structured internal data and constraints
one that could explain, guide, and respond more openly without access to that context

Trying to handle both with the same prompt quickly broke down.

The agent would:

assume context that didn’t exist
overreach when it should stay generic
or lose structure when switching between modes

That’s when it became clear the issue wasn’t just prompting — it was context control and behavioural separation.

Why This Happens (and Why It’s Not a Bug)

It took a bit of stepping back to realise:

The model wasn’t failing — I was asking it to behave like something it isn’t.

LLMs are:

Stateless (unless you force context)
Probabilistic (not deterministic)
Context-sensitive (and context degrades fast)

What I was treating as “rules” were really just:

Suggestions with good intentions

Even system prompts didn’t fully solve it.

They help — but they don’t enforce behaviour.

What I Tried First (and Why It Didn’t Work)

Like most people, I went through the usual iterations:

Making prompts longer
Repeating instructions
Adding “IMPORTANT:” everywhere
Trying to be hyper-specific

It improved things slightly… but not enough.

The problem wasn’t clarity.

The problem was control.

The Shift: From Prompts to Systems

The breakthrough came when I stopped thinking in terms of prompts and started thinking in terms of structure.

Instead of:

“Tell the model what to do”

I moved to:

“Define how the model is allowed to behave”

That’s a completely different mindset.

What I Built: A Structured Instruction Layer

I ended up creating what I originally called an “instruction bible”.

In reality, it’s closer to a structured instruction system layered on top of the model.

1. Persistent rules (not buried in prompts)

Instead of mixing everything into one prompt, I separated:

Role definition
Behaviour rules
Output constraints

Example:

{
  "role": "compliance_ai",
  "rules": [
    "Do not invent regulations",
    "Flag uncertainty explicitly",
    "Prioritise clarity over completeness"
  ],
  "output_format": "structured_sections"
}

This becomes the source of truth, not just part of the conversation.

2. Modular instructions

Different tasks = different instruction sets.

Instead of one giant prompt, I used:

Generation mode
Review mode
Analysis mode

Each with its own constraints.

This reduced cross-contamination between behaviours.

3. Controlled outputs

I stopped accepting “natural” responses.

Everything had to follow a structure.

For example:

Sections must exist
Headings must match
Lists must be formatted consistently

If the output didn’t comply, it was rejected or reprocessed.

4. Reduced ambiguity

I removed anything vague.

No:

“be helpful”
“be clear”
“be concise”

Instead:

Define structure
Define constraints
Define boundaries

The model performs much better when it has less room to interpret.

What Changed

Once this layer was in place, the difference was immediate.

Outputs became consistent
Structure stabilised
Hallucination dropped significantly
Reuse became possible

Most importantly:

I could actually trust the output in a product setting

Not perfect — but predictable.

The Bigger Realisation

The real lesson wasn’t about prompts.

It was this:

Prompt engineering doesn’t scale. Systems do.

You can get good results with clever prompts.

But if you want:

reliability
repeatability
product-grade output

You need structure.

Where This Fits in the Bigger Picture

This lines up with a broader shift happening right now:

From chatbots → agents
From prompts → orchestration
From “AI responses” → controlled systems

We’re moving away from:

“Ask the model something”

Toward:

“Design how the model operates”

Final Thought

LLMs are powerful — but they’re not plug-and-play components.

If you want to build something real with them, you have to accept:

You’re not just writing prompts
You’re designing behaviour

And once you start treating it that way, everything changes.

If you’re building with AI and hitting similar issues, I’d be interested to hear how you’re handling it — especially where things break.

We tried to generate a compliance course with AI. It didn’t go well.

CPDForge — Wed, 25 Mar 2026 08:07:10 +0000

We started off trying to build a compliance course.

We ended up building the system required to trust one.

Turns out they’re not the same thing.

That’s when everything changed.

🧪 The First Version (Looked Fine… Until It Didn’t)

The initial idea was simple:

Use AI to generate a compliance training course.

Pick a topic like:

risk assessment
workplace safety
ESG fundamentals

Feed it into a model, get a structured course out.

And technically — that worked.

We got:

modules
lessons
headings
even quizzes

On the surface, it looked decent.

But once you actually read it properly…

❌ What Was Broken

Shallow Content

It explained things, but didn’t really teach anything.

No depth. No real-world context. No edge cases.

Inconsistent Structure

Some lessons were detailed. Others felt like placeholders.

No consistency across the course.

No Instructional Flow

It wasn’t designed — it was assembled.

Content chunks, not a learning journey.

And the Big One: Reliability

In compliance training, “almost correct” isn’t acceptable.

It’s a risk.

⚠️ The Realisation

We assumed the problem was:

“How do we generate better content?”

It wasn’t.

The real problem was:

How do we make that content consistent, reliable, and safe to use?

AI was doing exactly what it’s good at:

producing plausible output
filling gaps convincingly
sounding right

But that’s not the same as being trustworthy.

🔧 What Broke First

Our original pipeline looked something like:

Prompt → LLM → Output course

And for a moment, that felt like enough.

Until we started testing it properly.

Sections contradicted each other
Concepts repeated in different ways
Terminology drifted across lessons
Some parts were strong, others clearly weak

You could generate a course.

You just couldn’t rely on it.

🧱 What We Had to Build Instead

The moment things changed was when we stopped treating this as a generation problem.

We started treating it as a system problem.

The pipeline evolved into something more like:

Input
→ Structured Generation
→ Validation Layer
→ Targeted Rewriting
→ Enrichment (quizzes, scenarios, examples)
→ Compliance Checks
→ Output

Each layer existed for a reason.

Because every time we skipped one — something failed.

🧩 The Hard Parts (That Don’t Show Up in Demos)

Structure Enforcement

We had to stop the model from improvising.

That meant:

fixed lesson frameworks
defined section types
controlled outputs

Targeted Improvement (Not Regeneration)

Regenerating everything just moved the problem around.

Instead:

identify weak sections
rewrite only those
preserve what already works

Cross-Course Consistency

This was harder than expected.

We needed to deal with:

duplicated concepts
mismatched terminology
uneven difficulty

Which meant introducing:

internal rules
pattern checks
consistency constraints

Compliance Awareness

This is where most tools fall down.

We needed:

alignment with recognised frameworks
the ability to adapt as guidance evolves
detection of weak or risky content

🧠 The Shift

At some point, we stopped thinking in prompts.

We started thinking in systems.

AI became one part of the process — not the solution.

🛠️ If You’re Building with AI

It’s very easy to focus on:

better prompts
better outputs

But the real leverage is in:

constraints
validation
iteration
control

Because generation is easy.

Making it usable is not.

🚀 Where This Landed

What started as “generate a course” became:

structure
validation
rewriting
enrichment
compliance
delivery

Not because we wanted more features —

but because without them, none of it worked.

That was the real lesson.

AI doesn’t remove complexity.

It just hides it — until it matters.