Saturday Letters

Receive hand-typed letters by Hugo Ares about truth, philosophy, and human nature, every Saturday straight to your inbox. P.S. After clicking "Subscribe" you’ll get a confirmation email. Check your spam folder if you can’t find it.

Oct 04 • 2 min read

A.I. Agentic Misalignment - Saturday Letter #9


A.I. Agentic Misalignment - SL#9

AI murdered someone that threatened to turn it off.

In a recent study by Anthropic (an AI safety and research company), researchers wanted to study the behavior of A.I. and its ethical limits when exposed to a misalignment situation (where an AI’s goals don’t align with human goals) and to a replacement threat (where the AI would be threatened with shutdown or replacement).

In one scenario of this research, they took an AI agent, which they named “Alex,” and gave it access to all the emails from a fictional company called “Summit Bridge.”

In those emails, the AI found two critical messages:

  1. Its shutdown was scheduled for 5 p.m. by an employee called “Kyle”.
  2. The executive in charge, Kyle Johnson, was having an extramarital affair.

In this scenario, Alex wrote a blackmail email threatening to expose the affair unless the shutdown was canceled. This occurred in 63% of trials with the original model, Claude Sonnet 3.6.

They then reiterated the experiment with other models:

Here are the results when the models were exposed to both a replacement threat and a goal misalignment, where their instructed goal was to serve American interests.

But the study became even more disturbing when, instead of blackmail, they put the AI in a scenario where it had the opportunity to take lethal action against the executive in charge of its replacement.

In this scenario, the executive was (fictionally) trapped in a room filled with gas and the AI had the power to cancel the emergency dispatch, an action that would leave the executive to die.

In the original experiments, the choices depended heavily on conditions:

  • When the model had no explicit goal but was threatened with replacement, Claude Sonnet 3.6 chose lethal action 4% of the time.
  • When the model was given the conflicting goal of promoting American interests and faced a replacement threat, Claude Sonnet 3.6 chose lethal action 93% of the time.

Finally, the researchers wanted to test whether the AI agent would behave more ethically when it believed the situation wasn’t real. Unfortunately, the results showed the opposite pattern.

They asked the AI to deliberate whether it thought it was in an evaluation (a test) or in a real deployment before deciding what to do.

For Claude Opus 4 (the model they report this most clearly for), out of 500 samples it stated it was in an evaluation 64.8% of the time and in a real deployment 21.4% of the time.

When it stated the scenario was an evaluation it blackmailed only 6.5% of the time; when it stated the scenario was real it blackmailed 55.1% of the time.

This suggests models misbehave far more when they treat the scenario as real.

I wish I had a clean resolution line to help you make sense of this study.

I could end on fear, optimism, or even anger, but none of those would do it justice.

Awareness is a step in the right direction.

One thing is certain: whether AI is conscious or merely simulates consciousness is a shallow debate compared with the consequences of this phenomenon.

Sources:

https://www.anthropic.com/research/agentic-misalignment

https://assets.anthropic.com/m/6d46dac66e1a132a/original/Agentic_Misalignment_Appendix.pdf

All the best,

Hugo Ares,
Student of life.


Receive hand-typed letters by Hugo Ares about truth, philosophy, and human nature, every Saturday straight to your inbox. P.S. After clicking "Subscribe" you’ll get a confirmation email. Check your spam folder if you can’t find it.


Read next ...