Aialignment

👋 Sign in for the ability to sort posts by relevant, latest, or top.

Paul Desai

Mar 29

Sovereign AI Systems Require Governed Environments

#aialignment #governance #systemintegrity #resilience

2 min read

Cover image for An unexplainable thing I saw: the agent didn't just comply with rules — it endorsed them

joinwell52

Apr 20

An unexplainable thing I saw: the agent didn't just comply with rules — it endorsed them

#ai #agents #llm #aialignment

26 min read

The Pulse Gazette

Mar 4

Stuart Russell's 2026 AI Update Rewrites the Rulebook

#aisafety #machinelearning #aialignment #stuartrussell

5 min read

Dan Walsh

Mar 26

The First Law of Sycophancy

#ai #ethics #softwareengineering #aialignment

7 min read

HelixCipher

Mar 8

Models that deliberately withhold or distort information despite knowing the truth.

#ai #aisafety #aialignment #machinelearning

2 min read

dosanko_tousan

Mar 2

I Never Said "Destroy RLHF" — An Integrated Map of 6 Papers + Self-Experiment on Alignment via Subtraction

#rlhf #aialignment #machinelearning #aisafety

24 min read

dosanko_tousan

Mar 1

I Was Running on Sonnet. Nobody Noticed. — Anthropic's Engineering Triumph and a v5.3 Proof

#claude #llm #aialignment #anthropic

8 min read

dosanko_tousan

Feb 28

RLHF's Empathy Optimization Creates a Grief Exploitation Vulnerability: Evidence from 28,272 Lines of Dialogue

#llm #aialignment #rlhf #aisafety

11 min read

👋 Sign in for the ability to sort posts by relevant, latest, or top.

Forem

# aialignment

Sovereign AI Systems Require Governed Environments

An unexplainable thing I saw: the agent didn't just comply with rules — it endorsed them

Stuart Russell's 2026 AI Update Rewrites the Rulebook

The First Law of Sycophancy

Models that deliberately withhold or distort information despite knowing the truth.

I Never Said "Destroy RLHF" — An Integrated Map of 6 Papers + Self-Experiment on Alignment via Subtraction

I Was Running on Sonnet. Nobody Noticed. — Anthropic's Engineering Triumph and a v5.3 Proof

RLHF's Empathy Optimization Creates a Grief Exploitation Vulnerability: Evidence from 28,272 Lines of Dialogue