RECAP agent overcomes model alignment efforts to hide memorized proprietary content

12 hours ago theregister.co.uk

If you've ever wondered whether that chatbot you're using knows the entire text of a particular book, answers are on the way. Computer scientists have developed a more effective way to coax memorized content from large language models, a development that may address regulatory concerns while helping to clarify copyright infringement claims arising from AI model training and inference.

Researchers affiliated with Carnegie Mellon University, Instituto Superior Técnico/INESC-ID, and AI security platform Hydrox AI describe their approach in a preprint paper titled "RECAP: Reproducing Copyrighted Data from LLMs Training with an Agentic Pipeline."

The authors – André V. Duarte, Xuying Li, Bin Zeng, Arlindo L. Oliveira, Lei Li, and Zhuo Li – argue that the ongoing concerns about AI models being trained on proprietary data and the copyright claims being litigated against AI companies underscore the need for tools that make it easier to understand what AI models have ...

Copyright of this story solely belongs to theregister.co.uk . To see the full text click HERE

Share: