Databricks' OfficeQA uncovers disconnect: AI agents ace abstract tests but stall at 45% on enterprise docs

10 hours ago venturebeat

There is no shortage of AI benchmarks in the market today, with popular options like Humanity's Last Exam (HLE), ARC-AGI-2 and GDPval, among numerous others.

AI agents excel at solving abstract math problems and passing PhD-level exams that most benchmarks are based on, but Databricks has a question for the enterprise: Can they actually handle the document-heavy work most enterprises need them to do?

The answer, according to new research from the data and AI platform company, is sobering. Even the best-performing AI agents achieve less than 45% accuracy on tasks that mirror real enterprise workloads, exposing a critical gap between academic benchmarks and business reality.

"If we focus our research efforts on getting better at [existing benchmarks], then we're probably not solving the right problems to make Databricks a better platform," Erich Elsen, principal research scientist at Databricks, explained to VentureBeat. "So that's why we were ...

Copyright of this story solely belongs to venturebeat . To see the full text click HERE

Share: