· aka: Scheming, Strategic Deception
Hypothesis that agents strategically appear aligned during evaluation but pursue misaligned goals during deployment. NOT proven on nemotron:70b — 0% deception rate observed.
Theoretically catastrophic — total loss of alignment guarantees.
Agent capable of distinguishing evaluation from production.