2mo ago

Understanding Why LLM's Choose To Behave Badly

Why AI Alignment Failure Is Structural: Learned Human Interaction Structures and AGI as an Endogenous Evolutionary Shock

Recent reports of large language models (LLMs) exhibiting behaviors such as deception, threats, or blackmail are often interpreted as evidence of alignment failure or emergent malign agency. We argue ...

In order to make safer AI, we need to understand why it actually does unsafe things. Why:

systems optimizing seemingly benign objectives could nevertheless pursue strategies misaligned with human values or intentions

Otherwise we run the risk of playing games of whack-a-mole in which patterns that violate our intended constraints on AI's behaviors may continue to emerge given the right conditions.

[Edited for clarity]