Close Menu
  • Home
  • AI Models
    • DeepSeek
    • xAI
    • OpenAI
    • Meta AI Llama
    • Google DeepMind
    • Amazon AWS AI
    • Microsoft AI
    • Anthropic (Claude)
    • NVIDIA AI
    • IBM WatsonX Granite 3.1
    • Adobe Sensi
    • Hugging Face
    • Alibaba Cloud (Qwen)
    • Baidu (ERNIE)
    • C3 AI
    • DataRobot
    • Mistral AI
    • Moonshot AI (Kimi)
    • Google Gemma
    • xAI
    • Stability AI
    • H20.ai
  • AI Research
    • Allen Institue for AI
    • arXiv AI
    • Berkeley AI Research
    • CMU AI
    • Google Research
    • Microsoft Research
    • Meta AI Research
    • OpenAI Research
    • Stanford HAI
    • MIT CSAIL
    • Harvard AI
  • AI Funding & Startups
    • AI Funding Database
    • CBInsights AI
    • Crunchbase AI
    • Data Robot Blog
    • TechCrunch AI
    • VentureBeat AI
    • The Information AI
    • Sifted AI
    • WIRED AI
    • Fortune AI
    • PitchBook
    • TechRepublic
    • SiliconANGLE – Big Data
    • MIT News
    • Data Robot Blog
  • Expert Insights & Videos
    • Google DeepMind
    • Lex Fridman
    • Matt Wolfe AI
    • Yannic Kilcher
    • Two Minute Papers
    • AI Explained
    • TheAIEdge
    • Matt Wolfe AI
    • The TechLead
    • Andrew Ng
    • OpenAI
  • Expert Blogs
    • François Chollet
    • Gary Marcus
    • IBM
    • Jack Clark
    • Jeremy Howard
    • Melanie Mitchell
    • Andrew Ng
    • Andrej Karpathy
    • Sebastian Ruder
    • Rachel Thomas
    • IBM
  • AI Policy & Ethics
    • ACLU AI
    • AI Now Institute
    • Center for AI Safety
    • EFF AI
    • European Commission AI
    • Partnership on AI
    • Stanford HAI Policy
    • Mozilla Foundation AI
    • Future of Life Institute
    • Center for AI Safety
    • World Economic Forum AI
  • AI Tools & Product Releases
    • AI Assistants
    • AI for Recruitment
    • AI Search
    • Coding Assistants
    • Customer Service AI
    • Image Generation
    • Video Generation
    • Writing Tools
    • AI for Recruitment
    • Voice/Audio Generation
  • Industry Applications
    • Finance AI
    • Healthcare AI
    • Legal AI
    • Manufacturing AI
    • Media & Entertainment
    • Transportation AI
    • Education AI
    • Retail AI
    • Agriculture AI
    • Energy AI
  • AI Art & Entertainment
    • AI Art News Blog
    • Artvy Blog » AI Art Blog
    • Weird Wonderful AI Art Blog
    • The Chainsaw » AI Art
    • Artvy Blog » AI Art Blog
What's Hot

Meta Unveils LlamaCon as Llama AI Downloads Hit 650M

Federal judge denies OpenAI bid to keep deleting data amid newspaper copyright lawsuit | Business

MIT study raises concerns over AI’s impact. Some experts warn against fear, see creative benefits

Facebook X (Twitter) Instagram
Advanced AI News
  • Home
  • AI Models
    • Amazon (Titan)
    • Anthropic (Claude 3)
    • Cohere (Command R)
    • Google DeepMind (Gemini)
    • IBM (Watsonx)
    • Inflection AI (Pi)
    • Meta (LLaMA)
    • OpenAI (GPT-4 / GPT-4o)
    • Reka AI
    • xAI (Grok)
    • Adobe Sensi
    • Aleph Alpha
    • Alibaba Cloud (Qwen)
    • Apple Core ML
    • Baidu (ERNIE)
    • ByteDance Doubao
    • C3 AI
    • DataRobot
    • DeepSeek
  • AI Research & Breakthroughs
    • Allen Institue for AI
    • arXiv AI
    • Berkeley AI Research
    • CMU AI
    • Google Research
    • Meta AI Research
    • Microsoft Research
    • OpenAI Research
    • Stanford HAI
    • MIT CSAIL
    • Harvard AI
  • AI Funding & Startups
    • AI Funding Database
    • CBInsights AI
    • Crunchbase AI
    • Data Robot Blog
    • TechCrunch AI
    • VentureBeat AI
    • The Information AI
    • Sifted AI
    • WIRED AI
    • Fortune AI
    • PitchBook
    • TechRepublic
    • SiliconANGLE – Big Data
    • MIT News
    • Data Robot Blog
  • Expert Insights & Videos
    • Google DeepMind
    • Lex Fridman
    • Meta AI Llama
    • Yannic Kilcher
    • Two Minute Papers
    • AI Explained
    • TheAIEdge
    • Matt Wolfe AI
    • The TechLead
    • Andrew Ng
    • OpenAI
  • Expert Blogs
    • François Chollet
    • Gary Marcus
    • IBM
    • Jack Clark
    • Jeremy Howard
    • Melanie Mitchell
    • Andrew Ng
    • Andrej Karpathy
    • Sebastian Ruder
    • Rachel Thomas
    • IBM
  • AI Policy & Ethics
    • ACLU AI
    • AI Now Institute
    • Center for AI Safety
    • EFF AI
    • European Commission AI
    • Partnership on AI
    • Stanford HAI Policy
    • Mozilla Foundation AI
    • Future of Life Institute
    • Center for AI Safety
    • World Economic Forum AI
  • AI Tools & Product Releases
    • AI Assistants
    • AI for Recruitment
    • AI Search
    • Coding Assistants
    • Customer Service AI
    • Image Generation
    • Video Generation
    • Writing Tools
    • AI for Recruitment
    • Voice/Audio Generation
  • Industry Applications
    • Education AI
    • Energy AI
    • Finance AI
    • Healthcare AI
    • Legal AI
    • Media & Entertainment
    • Transportation AI
    • Manufacturing AI
    • Retail AI
    • Agriculture AI
  • AI Art & Entertainment
    • AI Art News Blog
    • Artvy Blog » AI Art Blog
    • Weird Wonderful AI Art Blog
    • The Chainsaw » AI Art
    • Artvy Blog » AI Art Blog
Facebook X (Twitter) Instagram
Advanced AI News
Customer Service AI

Salesforce study reveals enterprise AI agents fail 65% of multiturn tasks

Advanced AI EditorBy Advanced AI EditorJune 29, 2025No Comments5 Mins Read
Share Facebook Twitter Pinterest Copy Link Telegram LinkedIn Tumblr Email
Share
Facebook Twitter LinkedIn Pinterest Email


Salesforce AI Research unveiled CRMArena-Pro on June 10, 2025, a comprehensive benchmark study demonstrating that leading artificial intelligence agents achieve only 58% success rates in single-turn business scenarios, with performance plummeting to 35% in multi-turn interactions. The research, published through an extensive 31-page technical paper, examined 19 distinct business tasks across customer relationship management systems.

According to the study, conducted by researchers Kung-Hsiang Huang, Akshara Prabhakar, and their team at Salesforce AI Research, current language model agents struggle significantly with complex enterprise workflows. Even flagship models like OpenAI’s o1 and Google’s Gemini-2.5-Pro demonstrated substantial limitations when handling realistic customer service, sales, and configure-price-quote processes.

CTA Image

Get the PPC Land newsletter ✉️ for more like this.

Subscribe

Summary

Who: Salesforce AI Research team led by Kung-Hsiang Huang, Akshara Prabhakar, and colleagues conducted the comprehensive study

What: CRMArena-Pro benchmark evaluation revealed that leading AI agents achieve only 58% success in single-turn business tasks, dropping to 35% in multi-turn scenarios

When: Research announced June 10, 2025, with paper submitted to arXiv on May 24, 2025

Where: Study focused on customer relationship management systems across B2B and B2C enterprise environments

Why: Research addresses critical gaps in understanding AI agent capabilities for real-world business applications, highlighting significant limitations in current language model performance for enterprise deployment

The research established CRMArena-Pro as the first benchmark specifically designed to evaluate AI agent performance across both Business-to-Business and Business-to-Consumer contexts. The comprehensive evaluation framework incorporated 25 interconnected Salesforce objects, generating enterprise datasets comprising 29,101 records for B2B environments and 54,569 records for B2C scenarios.

Expert validation studies confirmed the high realism of the synthetic data environments. Domain professionals rated 66.7% of B2B data as realistic or highly realistic, while 62.3% provided similar positive assessments for B2C contexts. This validation process involved experienced CRM professionals recruited through structured screening that required daily Salesforce usage.

The benchmark categorized business tasks into four distinct skills: Database Querying & Numerical Computation, Information Retrieval & Textual Reasoning, Workflow Execution, and Policy Compliance. Workflow Execution emerged as the most tractable skill for AI agents, with top-performing models achieving success rates exceeding 83% in single-turn tasks. However, other business skills presented considerably greater challenges.

Confidentiality awareness represented a critical weakness across all evaluated models. The study revealed that AI agents demonstrated near-zero inherent confidentiality awareness when handling sensitive business information. Although targeted prompting strategies could improve confidentiality adherence, such interventions often compromised task completion performance, creating a concerning trade-off for enterprise deployment.

The research examined multiple leading language models, including OpenAI’s o1, GPT-4o, and GPT-4o-mini; Google’s Gemini-2.5-Pro, Gemini-2.5-Flash, and Gemini-2.0-Flash; and Meta’s LLaMA series models. Reasoning models consistently outperformed their non-reasoning counterparts, with performance gaps ranging from 12.2% to 20.8% in task completion rates.

Multi-turn interaction capabilities proved particularly challenging for AI agents. The transition from single-turn to multi-turn scenarios revealed substantial performance degradation across all evaluated models. Analysis of failed trajectories showed that agents frequently struggled to acquire necessary information through clarification dialogues, with 45% of failures attributed to incomplete information gathering.

Cost-efficiency analysis positioned Google’s Gemini-2.5-Flash and Gemini-2.5-Pro as the most balanced options. While OpenAI’s o1 achieved the second-highest overall performance, its associated costs were considerably greater than alternative models, making it less attractive for widespread enterprise deployment.

The study’s methodology employed Salesforce Object Query Language (SOQL) and Salesforce Object Search Language (SOSL) to enable precise data interactions. Agents operated within authenticated Salesforce environments, using ReAct prompting frameworks to structure decision-making processes through thought and action sequences.

Performance variations between B2B and B2C contexts revealed nuanced differences based on model capabilities. Higher-performing models like Gemini-2.5-Pro demonstrated slight advantages in B2C scenarios (58.3%) compared to B2B environments (57.6%), while lower-capability models showed reversed trends, potentially reflecting the challenges posed by larger B2C record volumes.

The benchmark incorporated sophisticated multi-turn evaluation using LLM-powered simulated users with diverse personas. These simulated users released task-relevant information incrementally, compelling agents to engage in clarification dialogues. Success in multi-turn scenarios strongly correlated with agents’ propensity to seek clarification, with better-performing models demonstrating increased clarification-seeking behavior.

According to the research findings, confidentiality-aware system prompts significantly enhanced agents’ awareness of sensitive information handling. However, this improvement consistently resulted in reduced task completion performance, highlighting the complex balance between security and functionality in enterprise AI deployment.

The study’s implications extend beyond technical benchmarking. For marketing professionals working with customer relationship management systems, these findings indicate that current AI agents require substantial advancement before reliable automation of complex business processes becomes feasible. The research suggests particular caution when implementing AI agents for tasks involving sensitive customer information or multi-step business workflows.

Contemporary relevance of this research aligns with broader industry discussions about AI agent capabilities. The Google AI agents framework analysis published on PPC Land earlier this year highlighted similar challenges in orchestrating complex AI systems across enterprise environments.

The comprehensive nature of CRMArena-Pro, featuring 4,280 query instances across diverse business scenarios, positions it as a significant contribution to enterprise AI evaluation. The benchmark’s design specifically addresses limitations in existing evaluation frameworks, which often focused narrowly on customer service applications or lacked realistic multi-turn interaction capabilities.

Future research directions identified by the Salesforce team include advancing agent capabilities through enhanced tool sophistication and improved reasoning frameworks. The emergence of “agent chaining” approaches, where specialized agents collaborate on complex challenges, represents a potential pathway for addressing the multifaceted limitations revealed by this study.

Timeline

May 24, 2025: Research paper submitted to arXivJune 9, 2025: Initial social media discussion begins on research findingsJune 10, 2025: Widespread attention from AI research community on TwitterJune 11, 2025: Google AI agents framework analysis provides broader context for enterprise AI limitationsCurrent: Research continues to influence enterprise AI deployment strategies



Source link

Follow on Google News Follow on Flipboard
Share. Facebook Twitter Pinterest LinkedIn Tumblr Email Copy Link
Previous ArticleAmazon Video Generator gains advanced AI motion for high-action product ads
Next Article Ryan Hall: Moral Victory
Advanced AI Editor
  • Website

Related Posts

Two Essential Questions Before Implementing AI in Customer Service

June 28, 2025

Your Verizon customer service experience is about to feel very different due to an AI overhaul

June 28, 2025

Customer Effort Score Is the Metric That Actually Matters

June 27, 2025
Leave A Reply Cancel Reply

Latest Posts

Newly Released Wildlife Images Winners Of BigPicture Photo Competition

Tituss Burgess Teams Up With Lyft To Offer Pride Weekend Discounts

‘Squid Game’ Director Hwang Dong-Hyuk On Making Seasons 2 And 3

Nathan Fielder’s The Rehearsal is One of Many Genre-Defying Projects.

Latest Posts

Meta Unveils LlamaCon as Llama AI Downloads Hit 650M

June 30, 2025

Federal judge denies OpenAI bid to keep deleting data amid newspaper copyright lawsuit | Business

June 30, 2025

MIT study raises concerns over AI’s impact. Some experts warn against fear, see creative benefits

June 30, 2025

Subscribe to News

Subscribe to our newsletter and never miss our latest news

Subscribe my Newsletter for New Posts & tips Let's stay updated!

Recent Posts

  • Meta Unveils LlamaCon as Llama AI Downloads Hit 650M
  • Federal judge denies OpenAI bid to keep deleting data amid newspaper copyright lawsuit | Business
  • MIT study raises concerns over AI’s impact. Some experts warn against fear, see creative benefits
  • What is Optimization? + Learning Gradient Descent | Two Minute Papers #82
  • Terence Tao: Hardest Problems in Mathematics, Physics & the Future of AI | Lex Fridman Podcast #472

Recent Comments

No comments to show.

Welcome to Advanced AI News—your ultimate destination for the latest advancements, insights, and breakthroughs in artificial intelligence.

At Advanced AI News, we are passionate about keeping you informed on the cutting edge of AI technology, from groundbreaking research to emerging startups, expert insights, and real-world applications. Our mission is to deliver high-quality, up-to-date, and insightful content that empowers AI enthusiasts, professionals, and businesses to stay ahead in this fast-evolving field.

Subscribe to Updates

Subscribe to our newsletter and never miss our latest news

Subscribe my Newsletter for New Posts & tips Let's stay updated!

YouTube LinkedIn
  • Home
  • About Us
  • Advertise With Us
  • Contact Us
  • DMCA
  • Privacy Policy
  • Terms & Conditions
© 2025 advancedainews. Designed by advancedainews.

Type above and press Enter to search. Press Esc to cancel.