OpenAI, Anthropic Swap Safety Reviews

AI Giants Evaluated Each Other's Newer Models for Safety Risks Rashmi Ramesh (rashmiramesh_) • August 28, 2025

Image: Shutterstock

OpenAI and Anthropic swapped artificial intelligence models evaluations over the summer, testing the other company's models for behaviors that could indicate misalignment risks. The companies released their findings simultaneously, finding that no model was severely problematic, but that all demonstrated troubling behaviors in artificial testing scenarios.

The exercise involved OpenAI testing Anthropic's Claude Opus 4 and Claude Sonnet 4 models, while Anthropic evaluated OpenAI's GPT-4o, GPT-4.1, o3 and o4-mini models. Both companies disabled some safety filters.

The tests focused on "agentic misalignment evaluations," which involved placing AI systems in simulated scenarios with significant autonomy to observe behavior under stress conditions that might reveal alignment issues.

Auto-grading was unreliable in many cases, with both companies ...

Copyright of this story solely belongs to bankinfosecurity . To see the full text click HERE

Share:

More related news