Close Menu
  • Home
  • AI Models
    • DeepSeek
    • xAI
    • OpenAI
    • Meta AI Llama
    • Google DeepMind
    • Amazon AWS AI
    • Microsoft AI
    • Anthropic (Claude)
    • NVIDIA AI
    • IBM WatsonX Granite 3.1
    • Adobe Sensi
    • Hugging Face
    • Alibaba Cloud (Qwen)
    • Baidu (ERNIE)
    • C3 AI
    • DataRobot
    • Mistral AI
    • Moonshot AI (Kimi)
    • Google Gemma
    • xAI
    • Stability AI
    • H20.ai
  • AI Research
    • Allen Institue for AI
    • arXiv AI
    • Berkeley AI Research
    • CMU AI
    • Google Research
    • Microsoft Research
    • Meta AI Research
    • OpenAI Research
    • Stanford HAI
    • MIT CSAIL
    • Harvard AI
  • AI Funding & Startups
    • AI Funding Database
    • CBInsights AI
    • Crunchbase AI
    • Data Robot Blog
    • TechCrunch AI
    • VentureBeat AI
    • The Information AI
    • Sifted AI
    • WIRED AI
    • Fortune AI
    • PitchBook
    • TechRepublic
    • SiliconANGLE – Big Data
    • MIT News
    • Data Robot Blog
  • Expert Insights & Videos
    • Google DeepMind
    • Lex Fridman
    • Matt Wolfe AI
    • Yannic Kilcher
    • Two Minute Papers
    • AI Explained
    • TheAIEdge
    • Matt Wolfe AI
    • The TechLead
    • Andrew Ng
    • OpenAI
  • Expert Blogs
    • François Chollet
    • Gary Marcus
    • IBM
    • Jack Clark
    • Jeremy Howard
    • Melanie Mitchell
    • Andrew Ng
    • Andrej Karpathy
    • Sebastian Ruder
    • Rachel Thomas
    • IBM
  • AI Policy & Ethics
    • ACLU AI
    • AI Now Institute
    • Center for AI Safety
    • EFF AI
    • European Commission AI
    • Partnership on AI
    • Stanford HAI Policy
    • Mozilla Foundation AI
    • Future of Life Institute
    • Center for AI Safety
    • World Economic Forum AI
  • AI Tools & Product Releases
    • AI Assistants
    • AI for Recruitment
    • AI Search
    • Coding Assistants
    • Customer Service AI
    • Image Generation
    • Video Generation
    • Writing Tools
    • AI for Recruitment
    • Voice/Audio Generation
  • Industry Applications
    • Finance AI
    • Healthcare AI
    • Legal AI
    • Manufacturing AI
    • Media & Entertainment
    • Transportation AI
    • Education AI
    • Retail AI
    • Agriculture AI
    • Energy AI
  • AI Art & Entertainment
    • AI Art News Blog
    • Artvy Blog » AI Art Blog
    • Weird Wonderful AI Art Blog
    • The Chainsaw » AI Art
    • Artvy Blog » AI Art Blog
What's Hot

Time to Hold or Sell the Stock?

Nvidia CEO Jensen Huang calls US ban on H20 AI chip ‘deeply painful’

GRIT: Teaching MLLMs to Think with Images

Facebook X (Twitter) Instagram
Advanced AI News
  • Home
  • AI Models
    • Adobe Sensi
    • Aleph Alpha
    • Alibaba Cloud (Qwen)
    • Amazon AWS AI
    • Anthropic (Claude)
    • Apple Core ML
    • Baidu (ERNIE)
    • ByteDance Doubao
    • C3 AI
    • Cohere
    • DataRobot
    • DeepSeek
  • AI Research & Breakthroughs
    • Allen Institue for AI
    • arXiv AI
    • Berkeley AI Research
    • CMU AI
    • Google Research
    • Meta AI Research
    • Microsoft Research
    • OpenAI Research
    • Stanford HAI
    • MIT CSAIL
    • Harvard AI
  • AI Funding & Startups
    • AI Funding Database
    • CBInsights AI
    • Crunchbase AI
    • Data Robot Blog
    • TechCrunch AI
    • VentureBeat AI
    • The Information AI
    • Sifted AI
    • WIRED AI
    • Fortune AI
    • PitchBook
    • TechRepublic
    • SiliconANGLE – Big Data
    • MIT News
    • Data Robot Blog
  • Expert Insights & Videos
    • Google DeepMind
    • Lex Fridman
    • Meta AI Llama
    • Yannic Kilcher
    • Two Minute Papers
    • AI Explained
    • TheAIEdge
    • Matt Wolfe AI
    • The TechLead
    • Andrew Ng
    • OpenAI
  • Expert Blogs
    • François Chollet
    • Gary Marcus
    • IBM
    • Jack Clark
    • Jeremy Howard
    • Melanie Mitchell
    • Andrew Ng
    • Andrej Karpathy
    • Sebastian Ruder
    • Rachel Thomas
    • IBM
  • AI Policy & Ethics
    • ACLU AI
    • AI Now Institute
    • Center for AI Safety
    • EFF AI
    • European Commission AI
    • Partnership on AI
    • Stanford HAI Policy
    • Mozilla Foundation AI
    • Future of Life Institute
    • Center for AI Safety
    • World Economic Forum AI
  • AI Tools & Product Releases
    • AI Assistants
    • AI for Recruitment
    • AI Search
    • Coding Assistants
    • Customer Service AI
    • Image Generation
    • Video Generation
    • Writing Tools
    • AI for Recruitment
    • Voice/Audio Generation
  • Industry Applications
    • Education AI
    • Energy AI
    • Finance AI
    • Healthcare AI
    • Legal AI
    • Media & Entertainment
    • Transportation AI
    • Manufacturing AI
    • Retail AI
    • Agriculture AI
  • AI Art & Entertainment
    • AI Art News Blog
    • Artvy Blog » AI Art Blog
    • Weird Wonderful AI Art Blog
    • The Chainsaw » AI Art
    • Artvy Blog » AI Art Blog
Advanced AI News
Home » Why enterprise RAG systems fail: Google study introduces ‘sufficient context’ solution
VentureBeat AI

Why enterprise RAG systems fail: Google study introduces ‘sufficient context’ solution

Advanced AI BotBy Advanced AI BotMay 23, 2025No Comments8 Mins Read
Share Facebook Twitter Pinterest Copy Link Telegram LinkedIn Tumblr Email
Share
Facebook Twitter LinkedIn Pinterest Email


Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More

A new study from Google researchers introduces “sufficient context,” a novel perspective for understanding and improving retrieval augmented generation (RAG) systems in large language models (LLMs).

This approach makes it possible to determine if an LLM has enough information to answer a query accurately, a critical factor for developers building real-world enterprise applications where reliability and factual correctness are paramount.

The persistent challenges of RAG

RAG systems have become a cornerstone for building more factual and verifiable AI applications. However, these systems can exhibit undesirable traits. They might confidently provide incorrect answers even when presented with retrieved evidence, get distracted by irrelevant information in the context, or fail to extract answers from long text snippets properly.

The researchers state in their paper, “The ideal outcome is for the LLM to output the correct answer if the provided context contains enough information to answer the question when combined with the model’s parametric knowledge. Otherwise, the model should abstain from answering and/or ask for more information.”

Achieving this ideal scenario requires building models that can determine whether the provided context can help answer a question correctly and use it selectively. Previous attempts to address this have examined how LLMs behave with varying degrees of information. However, the Google paper argues that “while the goal seems to be to understand how LLMs behave when they do or do not have sufficient information to answer the query, prior work fails to address this head-on.”

Sufficient context

To tackle this, the researchers introduce the concept of “sufficient context.” At a high level, input instances are classified based on whether the provided context contains enough information to answer the query. This splits contexts into two cases:

Sufficient Context: The context has all the necessary information to provide a definitive answer.

Insufficient Context: The context lacks the necessary information. This could be because the query requires specialized knowledge not present in the context, or the information is incomplete, inconclusive or contradictory.

Source: arXiv

This designation is determined by looking at the question and the associated context without needing a ground-truth answer. This is vital for real-world applications where ground-truth answers are not readily available during inference.

The researchers developed an LLM-based “autorater” to automate the labeling of instances as having sufficient or insufficient context. They found that Google’s Gemini 1.5 Pro model, with a single example (1-shot), performed best in classifying context sufficiency, achieving high F1 scores and accuracy.

The paper notes, “In real-world scenarios, we cannot expect candidate answers when evaluating model performance. Hence, it is desirable to use a method that works using only the query and context.”

Key findings on LLM behavior with RAG

Analyzing various models and datasets through this lens of sufficient context revealed several important insights.

As expected, models generally achieve higher accuracy when the context is sufficient. However, even with sufficient context, models tend to hallucinate more often than they abstain. When the context is insufficient, the situation becomes more complex, with models exhibiting both higher rates of abstention and, for some models, increased hallucination.

Interestingly, while RAG generally improves overall performance, additional context can also reduce a model’s ability to abstain from answering when it doesn’t have sufficient information. “This phenomenon may arise from the model’s increased confidence in the presence of any contextual information, leading to a higher propensity for hallucination rather than abstention,” the researchers suggest.

A particularly curious observation was the ability of models sometimes to provide correct answers even when the provided context was deemed insufficient. While a natural assumption is that the models already “know” the answer from their pre-training (parametric knowledge), the researchers found other contributing factors. For example, the context might help disambiguate a query or bridge gaps in the model’s knowledge, even if it doesn’t contain the full answer. This ability of models to sometimes succeed even with limited external information has broader implications for RAG system design.

Source: arXiv

Cyrus Rashtchian, co-author of the study and senior research scientist at Google, elaborates on this, emphasizing that the quality of the base LLM remains critical. “For a really good enterprise RAG system, the model should be evaluated on benchmarks with and without retrieval,” he told VentureBeat. He suggested that retrieval should be viewed as “augmentation of its knowledge,” rather than the sole source of truth. The base model, he explains, “still needs to fill in gaps, or use context clues (which are informed by pre-training knowledge) to properly reason about the retrieved context. For example, the model should know enough to know if the question is under-specified or ambiguous, rather than just blindly copying from the context.”

Reducing hallucinations in RAG systems

Given the finding that models may hallucinate rather than abstain, especially with RAG compared to no RAG setting, the researchers explored techniques to mitigate this.

They developed a new “selective generation” framework. This method uses a smaller, separate “intervention model” to decide whether the main LLM should generate an answer or abstain, offering a controllable trade-off between accuracy and coverage (the percentage of questions answered).

This framework can be combined with any LLM, including proprietary models like Gemini and GPT. The study found that using sufficient context as an additional signal in this framework leads to significantly higher accuracy for answered queries across various models and datasets. This method improved the fraction of correct answers among model responses by 2–10% for Gemini, GPT, and Gemma models.

To put this 2-10% improvement into a business perspective, Rashtchian offers a concrete example from customer support AI. “You could imagine a customer asking about whether they can have a discount,” he said. “In some cases, the retrieved context is recent and specifically describes an ongoing promotion, so the model can answer with confidence. But in other cases, the context might be ‘stale,’ describing a discount from a few months ago, or maybe it has specific terms and conditions. So it would be better for the model to say, ‘I am not sure,’ or ‘You should talk to a customer support agent to get more information for your specific case.’”

The team also investigated fine-tuning models to encourage abstention. This involved training models on examples where the answer was replaced with “I don’t know” instead of the original ground-truth, particularly for instances with insufficient context. The intuition was that explicit training on such examples could steer the model to abstain rather than hallucinate.

The results were mixed: fine-tuned models often had a higher rate of correct answers but still hallucinated frequently, often more than they abstained. The paper concludes that while fine-tuning might help, “more work is needed to develop a reliable strategy that can balance these objectives.”

Applying sufficient context to real-world RAG systems

For enterprise teams looking to apply these insights to their own RAG systems, such as those powering internal knowledge bases or customer support AI, Rashtchian outlines a practical approach. He suggests first collecting a dataset of query-context pairs that represent the kind of examples the model will see in production. Next, use an LLM-based autorater to label each example as having sufficient or insufficient context. 

“This already will give a good estimate of the % of sufficient context,” Rashtchian said. “If it is less than 80-90%, then there is likely a lot of room to improve on the retrieval or knowledge base side of things — this is a good observable symptom.”

Rashtchian advises teams to then “stratify model responses based on examples with sufficient vs. insufficient context.” By examining metrics on these two separate datasets, teams can better understand performance nuances. 

“For example, we saw that models were more likely to provide an incorrect response (with respect to the ground truth) when given insufficient context. This is another observable symptom,” he notes, adding that “aggregating statistics over a whole dataset may gloss over a small set of important but poorly handled queries.”

While an LLM-based autorater demonstrates high accuracy, enterprise teams might wonder about the additional computational cost. Rashtchian clarified that the overhead can be managed for diagnostic purposes. 

“I would say running an LLM-based autorater on a small test set (say 500-1000 examples) should be relatively inexpensive, and this can be done ‘offline’ so there’s no worry about the amount of time it takes,” he said. For real-time applications, he concedes, “it would be better to use a heuristic, or at least a smaller model.” The crucial takeaway, according to Rashtchian, is that “engineers should be looking at something beyond the similarity scores, etc, from their retrieval component. Having an extra signal, from an LLM or a heuristic, can lead to new insights.”

Daily insights on business use cases with VB Daily

If you want to impress your boss, VB Daily has you covered. We give you the inside scoop on what companies are doing with generative AI, from regulatory shifts to practical deployments, so you can share insights for maximum ROI.

Read our Privacy Policy

Thanks for subscribing. Check out more VB newsletters here.

An error occured.



Source link

Follow on Google News Follow on Flipboard
Share. Facebook Twitter Pinterest LinkedIn Tumblr Email Copy Link
Previous ArticleArtemis Seaford and Ion Stoica cover the ethical crisis at Sessions: AI
Next Article AI, CEOs, and the Wild West of Streaming
Advanced AI Bot
  • Website

Related Posts

Anthropic faces backlash to Claude 4 Opus behavior that contacts authorities, press if it thinks you’re doing something ‘egregiously immoral’

May 22, 2025

Anthropic faces backlash to Claude 4 Opus feature that contacts authorities, press if it thinks you’re doing something ‘egregiously immoral’

May 22, 2025

Anthropic overtakes OpenAI: Claude Opus 4 codes seven hours nonstop, sets record SWE-Bench score and reshapes enterprise AI

May 22, 2025
Leave A Reply Cancel Reply

Latest Posts

Frida Kahlo Museum to Open in Mexico City This September

Sotheby’s to Sell 100 Objects Once Belonging to Napoleon

Eva Helene Pade & Margeurite Humeau

Josh Sperling’s West Coast Debut At Perrotin

Latest Posts

Time to Hold or Sell the Stock?

May 23, 2025

Nvidia CEO Jensen Huang calls US ban on H20 AI chip ‘deeply painful’

May 23, 2025

GRIT: Teaching MLLMs to Think with Images

May 23, 2025

Subscribe to News

Subscribe to our newsletter and never miss our latest news

Subscribe my Newsletter for New Posts & tips Let's stay updated!

Welcome to Advanced AI News—your ultimate destination for the latest advancements, insights, and breakthroughs in artificial intelligence.

At Advanced AI News, we are passionate about keeping you informed on the cutting edge of AI technology, from groundbreaking research to emerging startups, expert insights, and real-world applications. Our mission is to deliver high-quality, up-to-date, and insightful content that empowers AI enthusiasts, professionals, and businesses to stay ahead in this fast-evolving field.

Subscribe to Updates

Subscribe to our newsletter and never miss our latest news

Subscribe my Newsletter for New Posts & tips Let's stay updated!

YouTube LinkedIn
  • Home
  • About Us
  • Advertise With Us
  • Contact Us
  • DMCA
  • Privacy Policy
  • Terms & Conditions
© 2025 advancedainews. Designed by advancedainews.

Type above and press Enter to search. Press Esc to cancel.