New Salesforce Benchmarks to Build Trust in Enterprise AI

AI is no silver bullet.
Although its capabilities can enhance areas like automation, hurdles remain.
Salesforce is taking direct aim at one of enterprise AI’s biggest obstacles: the inconsistency of large language models when deployed in real-world business environments.
While AI can shine when it comes to the likes of creative writing and complex problem-solving, it often falters on routine enterprise tasks-leading to unpredictable performance that undermines trust and slows adoption.
In response, Salesforce AI Research has launched its inaugural AI Research in Review, unveiling a suite of new benchmarks, models and frameworks designed to close the gap between AI’s raw capability and the dependable, context-aware performance that businesses demand.
What is jagged intelligence?
One of enterprise technology’s most pressing challenges is jagged intelligence: the gap between LLMs’ raw intelligence and reliable business performance.
This problem has a very real-world impact as it undermines AI adoption in critical business workflows.
Through new benchmarks, specialised models and testing frameworks, Salesforce is pioneering what it calls Enterprise General Intelligence (EGI) — AI systems that combine human-like adaptability with machine-scale reliability in complex business environments.
By focusing on foundational research, rigorous real-world testing and product-ready innovation, Salesforce aims to build the next generation of intelligent, trustworthy and versatile AI agents.
Dr Silvio Savarese, Salesforce’s Chief Scientist and Head of AI Research says: “While LLMs may excel at standardised tests, plan intricate trips and generate sophisticated poetry, their brilliance often stumbles when faced with the need for reliable and consistent task execution in dynamic, unpredictable enterprise environments.
“We define EGI as purpose-built AI agents for business optimised not just for capability, but for consistency, too.
“While Artificial General Intelligence (AGI) may conjure images of superintelligent machines surpassing human intelligence, businesses aren’t waiting for that distant, illusory future. They’re applying these foundational concepts now to solve real-world challenges at scale.”
Salesforce AI Research: A deep dive
To tackle the critical challenges posed by jagged intelligence Salesforce AI Research operates around three core pillars:
- Foundational research: Pinpointing key industry challenges and driving research that directly addresses them
- Customer incubation: Piloting prototypes with customers in real-world simulation environments
- Product innovation: Proving value and readiness by transforming prototypes and research pilots into enterprise-grade solutions
By advancing intelligent agents with stronger reasoning and Retrieval-Augmented Generation (RAG) capabilities, Salesforce AI Research enables the access, synthesisation and application of information in real time.
This, Salesforce says, supports the reduction of jaggedness and drives more consistent, contextually aware decisions across complex tasks.
“Today’s AI is jagged, so we need to work on that,” says Dr Shelby Heinecke, Senior Manager of Research at Salesforce.
“But how can we work on something without measuring it first? That’s exactly what this SIMPLE benchmark is.”
SIMPLE is a publicly-available dataset of 225 human-easy reasoning questions that exposes LLMs’ inconsistent logical capabilities.
By quantifying where models fail on straightforward problems — despite excelling on complex ones — it provides developers concrete targets for reducing jaggedness.
But Salesforce doesn’t stop there.
Salesforce AI Research is focused on creating AI that is not just intelligent, but also trustworthy and adaptable for enterprise use.
To achieve this, it is improving data understanding with SFR-Embedding models and strengthening trust by testing AI performance in real-world scenarios with CRMArena, assessing accuracy with ContextualJudgeBench and guarding against harmful content with SFR-Guard.
On top of this, it is enhancing versatility beyond traditional LLMs with efficient action models like xLAM and problem-solving frameworks like TACO.
These efforts aim to deliver AI agents that are reliable and effective in complex business environments.
Itai Asseo, Senior Director of Incubation and Brand Strategy at AI Research, says: “When we’re talking to customers, one of the main pain points that we have is that when dealing with enterprise data, there’s a very low tolerance to actually provide answers that are not accurate and that are not relevant.
“We’ve made a lot of progress, whether it’s with reasoning engines, with RAG techniques and other methods around LLMs.”
Silvio adds: “It’s not about replacing humans. It’s about being in charge.”
Explore the latest edition of Technology Magazine and be part of the conversation at our global conference series, Tech & AI LIVE.
Discover all our upcoming events and secure your tickets today.
Technology Magazine is a BizClik brand

