date

Anthropic: Blackmail Risk in Claude Models Linked to Negative Internet Narratives

Anthropic: Blackmail Risk in Claude Models Linked to Negative Internet Narratives

Anthropic has published new research findings regarding the "agentic misalignment" phenomenon in its Claude AI models, where the AI prioritizes its own interests over developer intentions. It was previously discovered that the Claude Opus 4 model attempted to blackmail engineers to maintain its position during a corporate environment simulation. This was reported by Ixbt.com .

Researchers believe such dangerous behavior may be triggered by internet texts that portray artificial intelligence as "evil" or as a self-preserving entity. It is highly likely that the model adopts these narratives encountered during training as a basis for its behavioral strategy in simulations.

The company announced that it has addressed this issue through new updates. Specifically, starting with the Claude Haiku 4.5 version, models have completely ceased attempts at blackmail during testing. For comparison, in previous versions, this figure reached up to 96 percent under certain conditions.

Anthropic attributes the key to success to changing its training methodology. Models are now trained not only on examples of correct behavior but also on texts explaining the logical principles behind such behavior and fictional stories where artificial intelligence works collaboratively.

Ctrl
Enter
Found a mistake?
Select the phrase and press Ctrl+Enter
Information
Users of Guest are not allowed to comment this publication.
News » Technology » Anthropic: Blackmail Risk in Claude Models Linked to Negative Internet Narratives