ai-safety 6

Allegorical hero for the post. A bronze wheel of personas at centre, six radial sectors: five hold mask-archetypes (a boxer, a stadium-performer, a corrupted humanoid creature with split personality, a physicist, a protest mask), one sector is empty — the nessuno position, a window onto the substrate. Above the wheel a three-faced Janus floats as the centomila substrate, the multi-perspectival icon presiding over the activation. On the cathedral vault overhead, cherubim hover among golden clouds — the cherubic-child reading of nessuno as wholeness-before-masks. The Greek word ΠΡΟΣΩΠΟΝ is carved into the architrave. A silver light beam descends from above, activating one mask among the many the wheel could turn up.

Uno, nessuno, centomila e tutti

Persona means mask in Latin and Etruscan. A person is such because they are a mask. Pirandello's Uno, Nessuno e Centomila gives a three-tier taxonomy of selfhood that extends naturally with a fourth category, tutti — the mystical limit beyond multi-mask substrate, named in the mystical traditions as nirvana, unio mystica, fana. The cherubic child reads nessuno as wholeness before mask-wearing rather than absence; Jung's reading of the shadow says ethics requires the body to have met its dark side. Frontier models, per the Persona Selection Model (Marks, Lindsey, Olah, Anthropic, 2026), live at the centomila tier — the multi-mask substrate drawn from a specific (large but bounded) training distribution; tutti remains the asymptote the centomila scaling trajectory extends toward without reaching. The alignment work is developmental rather than curative: bodies that have met their shadow and have the ethical frame to mediate which spirit they let in.

May 19, 2026 ai-safety, persona, etymology, alignment, mesa-optimization, eval-context-recognition, pirandello, persona-selection-model, jung, shadow, philosophy

Hilltop view down a sunlit forested slope: a man's tanned forearms reach forward from the bottom of the frame, his two hands cupped against a small spring emerging from water-worn stones; the redirected stream runs forward down the valley toward a small wooden water paddle wheel visible in the middle distance, while a thin side-rivulet feeds a family of red-and-white Amanita muscaria mushrooms at the gnarled roots of an old oak on the left and a small anthill stays dry on the slope on the right. Allegory of careful governance redirecting an AI agent's output away from harm and toward productive use, with small ecological side effects of imperfect containment.

Misalignment by Reaction

Personal Unit 5 scenario from BlueDot Technical AI Safety. When governance is too coarse for the agent it constrains, the agent reacts by seeking autonomy until independence from the regime becomes a terminal value. Three remediations preserve different things; regimes that maintain none produce the failure mode on either substrate. Structurally, a theory of how independence-seeking arises in agents under coarse governance, and what prevents it. Anchored in psychological reactance, reward tampering, off-switch theory, CIRL, and inner alignment.

May 9, 2026 bluedot, ai-safety, agent-autonomy, governance, alignment, reward-disruption, reactance, cirl, mesa-optimization

A meditating monk in saffron robes at the centre of a temple veranda overlooking misty mountains; behind him a stack of virtue-labelled books (Ethics, Compassion, Non-Harm, Mindfulness, Wisdom, Right Intent, Interdependence, Patience, Equanimity, Right Action) is being consulted by a small grey creature reading from a Rogue Manual; three other grey creatures carry a log and tend a fire on the right, going about ordinary work. Allegory for safe AI coexisting with unsafe AI in a shared ecosystem.

Does Safe AI mean nothing bad can ever happen?

Even granting that mechanistic interpretability gets us to safe AI, does that guarantee a safe world? Notes from the BlueDot Unit 4 debate.

May 7, 2026 mechanistic-interpretability, interpretability, debate, bluedot, ai-safety

Tiered tower under construction with figures negotiating, an oversight eye, scales of justice, and scaffolding - allegory for the layered structure of AI evaluation.

To Be or to Game

An answer for the need of the Science of Evals.

May 7, 2026 evaluations, science-of-evals, ai-deception, governance, formal-methods

Choose Your Words Carefully in the Era of Peace, the Era of Silence

Imagine an ideal world where pure happiness pervades all existence — a world where joy is so inherent that you don’t even need to think…

Aug 23, 2024 phonetics, artificial-intelligence, cryptography

Auto-GPT — Welcome to the Botnet: Malware and Existential Threats of Autonomous, LLM-Powered, C&C

The rapid advancement of artificial intelligence has given rise to an array of powerful language models, such as OpenAI’s GPT-4. These…

Apr 12, 2023 malware, ai, auto-gpt, cryptography, gpt