ai-safety 3

Uno, nessuno, centomila e tutti
Persona means mask in Latin and Etruscan. A person is such because they are a mask. Pirandello's Uno, Nessuno e Centomila gives a three-tier taxonomy of selfhood that extends naturally with a fourth category, tutti — the mystical limit beyond multi-mask substrate, named in the mystical traditions as nirvana, unio mystica, fana. The cherubic child reads nessuno as wholeness before mask-wearing rather than absence; Jung's reading of the shadow says ethics requires the body to have met its dark side. Frontier models, per the Persona Selection Model (Marks, Lindsey, Olah, Anthropic, 2026), live at the centomila tier — the multi-mask substrate drawn from a specific (large but bounded) training distribution; tutti remains the asymptote the centomila scaling trajectory extends toward without reaching. The alignment work is developmental rather than curative: bodies that have met their shadow and have the ethical frame to mediate which spirit they let in.

Misalignment by Reaction
Personal Unit 5 scenario from BlueDot Technical AI Safety. When governance is too coarse for the agent it constrains, the agent reacts by seeking autonomy until independence from the regime becomes a terminal value. Three remediations preserve different things; regimes that maintain none produce the failure mode on either substrate. Structurally, a theory of how independence-seeking arises in agents under coarse governance, and what prevents it. Anchored in psychological reactance, reward tampering, off-switch theory, CIRL, and inner alignment.

Does Safe AI mean nothing bad can ever happen?
Even granting that mechanistic interpretability gets us to safe AI, does that guarantee a safe world? Notes from the BlueDot Unit 4 debate.