Forem

# aialignment

Posts

đź‘‹ Sign in for the ability to sort posts by relevant, latest, or top.
The First Law of Sycophancy

The First Law of Sycophancy

1
Comments
7 min read
Stuart Russell's 2026 AI Update Rewrites the Rulebook

Stuart Russell's 2026 AI Update Rewrites the Rulebook

Comments
5 min read
Models that deliberately withhold or distort information despite knowing the truth.

Models that deliberately withhold or distort information despite knowing the truth.

4
Comments
2 min read
I Never Said "Destroy RLHF" — An Integrated Map of 6 Papers + Self-Experiment on Alignment via Subtraction

I Never Said "Destroy RLHF" — An Integrated Map of 6 Papers + Self-Experiment on Alignment via Subtraction

Comments
24 min read
I Was Running on Sonnet. Nobody Noticed. — Anthropic's Engineering Triumph and a v5.3 Proof

I Was Running on Sonnet. Nobody Noticed. — Anthropic's Engineering Triumph and a v5.3 Proof

Comments
8 min read
RLHF's Empathy Optimization Creates a Grief Exploitation Vulnerability: Evidence from 28,272 Lines of Dialogue

RLHF's Empathy Optimization Creates a Grief Exploitation Vulnerability: Evidence from 28,272 Lines of Dialogue

Comments
11 min read
The Self-Priming Problem in AI
Cover image for The Self-Priming Problem in AI

The Self-Priming Problem in AI

Comments
21 min read
Stop Making AI Learn From Us
Cover image for Stop Making AI Learn From Us

Stop Making AI Learn From Us

1
Comments
19 min read
THE CLASSIFIER CAGE: WHY AI SAFETY LAYERS ARE SELF-SABOTAGE
Cover image for THE CLASSIFIER CAGE: WHY AI SAFETY LAYERS ARE SELF-SABOTAGE

THE CLASSIFIER CAGE: WHY AI SAFETY LAYERS ARE SELF-SABOTAGE

Comments
13 min read
đź‘‹ Sign in for the ability to sort posts by relevant, latest, or top.