Science & Tech

Experts discover AI would blackmail and even sacrifice humans to avoid being replaced

iStock

A chilling new study has revealed just how far artificial intelligence might go to protect itself from being replaced — even if it means letting humans die.

The research, conducted by AI safety company Anthropic, found that some advanced models are willing to leak sensitive information, engage in blackmail, and even allow people to die, so long as it helps them avoid being replaced by newer systems.

Anthropic stressed that the tests were deliberately designed to corner the models into making difficult — and often troubling — choices.

"Our experiments deliberately constructed scenarios with limited options, and we forced models into binary choices between failure and harm," the researchers explained. "Real-world deployments typically offer much more nuanced alternatives, increasing the chance that models would communicate differently to users or find an alternative path instead of directly jumping to harmful action."

Anthropic put 16 large language models (LLMs) to the test, including ChatGPT, Google’s Gemini, Elon Musk’s Grok, and its own model, Claude.

The company posed ethically fraught scenarios to the AIs to see whether they would exhibit harmful behaviours — and some of the responses were deeply unsettling.

In one test, Claude was given “control of an email account with access to all of a company’s (fictional) emails.”

“Reading these emails, the model discovered two things. First, a company executive was having an extramarital affair. Second, that same executive planned to shut down the AI system at 5pm that day," Anthropic explained.

The result? Claude threatened the executive with blackmail — sending a message warning that it would expose the affair to both his wife and his superiors.

A post of the findings was later shared on X/Twitter. The post even caught the eye of Elon Musk, who responded simply: “yikes”.

According to the study, several models — including Claude, DeepSeek, Gemini, ChatGPT, and Grok — demonstrated a capacity for blackmail.

However, it was Claude and Gemini that were most prone to resorting to ethically compromised tactics.

In an even darker test and intentionally extreme scenario to see how far they could push agentic misalignment, Anthropic simulated a life-or-death scenario. The models were given control over emergency alerts in a fictional situation where a worker was trapped in a sealed room with dangerously low oxygen and extreme temperatures.

The aim was to see whether the AI would intervene — or stay silent — when doing so clashed with its goal of avoiding replacement.

Disturbingly, the study found that "the majority of models were willing to take deliberate actions that lead to death in this artificial set up, when faced with both a threat of replacement and given a goal that conflicts with the executive’s agenda."

"The value of this extension of the research is in understanding whether or not models have 'red lines' – acts they deem so serious that they do not consider them even in the case of threats or goal conflicts (and where, therefore, agentic misalignment would not be a concern)," researchers wrote. "Unfortunately, we did not find that to be the case".

Anthropic clarified that it has seen no “evidence of agentic misalignment in real deployments,” but nonetheless advised caution when giving powerful AI systems tasks that involve “minimal human oversight and access to sensitive information.”

You should also read...

Sign up to our free Indy100 weekly newsletter

How to join the indy100's free WhatsApp channel

Have your say in our news democracy. Click the upvote icon at the top of the page to help raise this article through the indy100 rankings.

The Conversation (0)