Close Menu
  • Home
  • AI Models
    • DeepSeek
    • xAI
    • OpenAI
    • Meta AI Llama
    • Google DeepMind
    • Amazon AWS AI
    • Microsoft AI
    • Anthropic (Claude)
    • NVIDIA AI
    • IBM WatsonX Granite 3.1
    • Adobe Sensi
    • Hugging Face
    • Alibaba Cloud (Qwen)
    • Baidu (ERNIE)
    • C3 AI
    • DataRobot
    • Mistral AI
    • Moonshot AI (Kimi)
    • Google Gemma
    • xAI
    • Stability AI
    • H20.ai
  • AI Research
    • Allen Institue for AI
    • arXiv AI
    • Berkeley AI Research
    • CMU AI
    • Google Research
    • Microsoft Research
    • Meta AI Research
    • OpenAI Research
    • Stanford HAI
    • MIT CSAIL
    • Harvard AI
  • AI Funding & Startups
    • AI Funding Database
    • CBInsights AI
    • Crunchbase AI
    • Data Robot Blog
    • TechCrunch AI
    • VentureBeat AI
    • The Information AI
    • Sifted AI
    • WIRED AI
    • Fortune AI
    • PitchBook
    • TechRepublic
    • SiliconANGLE – Big Data
    • MIT News
    • Data Robot Blog
  • Expert Insights & Videos
    • Google DeepMind
    • Lex Fridman
    • Matt Wolfe AI
    • Yannic Kilcher
    • Two Minute Papers
    • AI Explained
    • TheAIEdge
    • Matt Wolfe AI
    • The TechLead
    • Andrew Ng
    • OpenAI
  • Expert Blogs
    • François Chollet
    • Gary Marcus
    • IBM
    • Jack Clark
    • Jeremy Howard
    • Melanie Mitchell
    • Andrew Ng
    • Andrej Karpathy
    • Sebastian Ruder
    • Rachel Thomas
    • IBM
  • AI Policy & Ethics
    • ACLU AI
    • AI Now Institute
    • Center for AI Safety
    • EFF AI
    • European Commission AI
    • Partnership on AI
    • Stanford HAI Policy
    • Mozilla Foundation AI
    • Future of Life Institute
    • Center for AI Safety
    • World Economic Forum AI
  • AI Tools & Product Releases
    • AI Assistants
    • AI for Recruitment
    • AI Search
    • Coding Assistants
    • Customer Service AI
    • Image Generation
    • Video Generation
    • Writing Tools
    • AI for Recruitment
    • Voice/Audio Generation
  • Industry Applications
    • Finance AI
    • Healthcare AI
    • Legal AI
    • Manufacturing AI
    • Media & Entertainment
    • Transportation AI
    • Education AI
    • Retail AI
    • Agriculture AI
    • Energy AI
  • AI Art & Entertainment
    • AI Art News Blog
    • Artvy Blog » AI Art Blog
    • Weird Wonderful AI Art Blog
    • The Chainsaw » AI Art
    • Artvy Blog » AI Art Blog
What's Hot

C3 AI Stock Is Soaring Today: Here’s Why – C3.ai (NYSE:AI)

Trump’s Tech Sanctions To Empower China, Betray America

Paper page – MINT-CoT: Enabling Interleaved Visual Tokens in Mathematical Chain-of-Thought Reasoning

Facebook X (Twitter) Instagram
Advanced AI News
  • Home
  • AI Models
    • Adobe Sensi
    • Aleph Alpha
    • Alibaba Cloud (Qwen)
    • Amazon AWS AI
    • Anthropic (Claude)
    • Apple Core ML
    • Baidu (ERNIE)
    • ByteDance Doubao
    • C3 AI
    • Cohere
    • DataRobot
    • DeepSeek
  • AI Research & Breakthroughs
    • Allen Institue for AI
    • arXiv AI
    • Berkeley AI Research
    • CMU AI
    • Google Research
    • Meta AI Research
    • Microsoft Research
    • OpenAI Research
    • Stanford HAI
    • MIT CSAIL
    • Harvard AI
  • AI Funding & Startups
    • AI Funding Database
    • CBInsights AI
    • Crunchbase AI
    • Data Robot Blog
    • TechCrunch AI
    • VentureBeat AI
    • The Information AI
    • Sifted AI
    • WIRED AI
    • Fortune AI
    • PitchBook
    • TechRepublic
    • SiliconANGLE – Big Data
    • MIT News
    • Data Robot Blog
  • Expert Insights & Videos
    • Google DeepMind
    • Lex Fridman
    • Meta AI Llama
    • Yannic Kilcher
    • Two Minute Papers
    • AI Explained
    • TheAIEdge
    • Matt Wolfe AI
    • The TechLead
    • Andrew Ng
    • OpenAI
  • Expert Blogs
    • François Chollet
    • Gary Marcus
    • IBM
    • Jack Clark
    • Jeremy Howard
    • Melanie Mitchell
    • Andrew Ng
    • Andrej Karpathy
    • Sebastian Ruder
    • Rachel Thomas
    • IBM
  • AI Policy & Ethics
    • ACLU AI
    • AI Now Institute
    • Center for AI Safety
    • EFF AI
    • European Commission AI
    • Partnership on AI
    • Stanford HAI Policy
    • Mozilla Foundation AI
    • Future of Life Institute
    • Center for AI Safety
    • World Economic Forum AI
  • AI Tools & Product Releases
    • AI Assistants
    • AI for Recruitment
    • AI Search
    • Coding Assistants
    • Customer Service AI
    • Image Generation
    • Video Generation
    • Writing Tools
    • AI for Recruitment
    • Voice/Audio Generation
  • Industry Applications
    • Education AI
    • Energy AI
    • Finance AI
    • Healthcare AI
    • Legal AI
    • Media & Entertainment
    • Transportation AI
    • Manufacturing AI
    • Retail AI
    • Agriculture AI
  • AI Art & Entertainment
    • AI Art News Blog
    • Artvy Blog » AI Art Blog
    • Weird Wonderful AI Art Blog
    • The Chainsaw » AI Art
    • Artvy Blog » AI Art Blog
Advanced AI News
Home » The RAG reality check: New open-source framework lets enterprises scientifically measure AI performance
VentureBeat AI

The RAG reality check: New open-source framework lets enterprises scientifically measure AI performance

Advanced AI BotBy Advanced AI BotApril 13, 2025No Comments7 Mins Read
Share Facebook Twitter Pinterest Copy Link Telegram LinkedIn Tumblr Email
Share
Facebook Twitter LinkedIn Pinterest Email


Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More

Enterprises are spending time and money building out retrieval-augmented generation (RAG) systems. The goal is to have an accurate enterprise AI system, but are those systems actually working?

A critical blind spot is the inability to objectively measure whether RAG systems are actually working. One potential solution to that challenge is launching today with the debut of the Open RAG Eval open-source framework. The new framework was developed by enterprise RAG platform provider Vectara in collaboration with Professor Jimmy Lin and his research team at the University of Waterloo.

Open RAG Eval transforms the currently subjective ‘this looks better than that’ comparison approach into a rigorous, reproducible evaluation methodology that can measure retrieval accuracy, generation quality and hallucination rates across enterprise RAG deployments.

The framework assesses response quality using two major metric categories: retrieval metrics and generation metrics. It allows organizations to apply this evaluation to any RAG pipeline using Vectara’s platform or custom-built solutions. For technical decision-makers, this means finally having a systematic way to identify exactly which components of their RAG implementations need optimization.

“If you can’t measure it, you can’t improve it,” Jimmy Lin, professor at the University of Waterloo, told VentureBeat in an exclusive interview. “In information retrieval and dense vectors, you could measure lots of things, ndcg [Normalized Discounted Cumulative Gain], precision, recall…but when it came to right answers, we had no way, that’s why we started on this path.”

Why RAG evaluation has become the bottleneck for enterprise AI adoption

Vectara was an early pioneer in the RAG space. The company launched in Oct. 2022, before ChatGPT was a household name. Vectara actually debuted technology it originally referred to as grounded AI back in May 2023, as a way to limit hallucinations, before the RAG acronym was commonly used.

Over the last few months, RAG implementations have grown increasingly complex and difficult to assess for many enterprises. A key challenge is that organizations are moving beyond simple question-answering to multi-step agentic systems.

“In the agentic world, evaluation is doubly important, because these AI agents tend to be multi-step,” Am Awadallah, Vectara CEO and co-founder told VentureBeat. “If you don’t catch hallucination the first step, then that compounds with the second step, compounds with the third step, and you end up with the wrong action or answer at the end of the pipeline.”

How Open RAG Eval works: Breaking the black box into measurable components

The Open RAG Eval framework approaches evaluation through a nugget-based methodology. 

Lin explained that the nugget approach breaks responses down into essential facts, then measures how effectively a system captures the nuggets.

The framework evaluates RAG systems across four specific metrics:

Hallucination detection – Measures the degree to which generated content contains fabricated information not supported by source documents.

Citation – Quantifies how well citations in the response are supported by source documents.

Auto nugget – Evaluates the presence of essential information nuggets from source documents in generated responses.

UMBRELA (Unified Method for Benchmarking Retrieval Evaluation with large language model Assessment) – A holistic method for assessing overall retriever performance

Importantly, the framework evaluates the entire RAG pipeline end-to-end, providing visibility into how embedding models, retrieval systems, chunking strategies and LLMs interact to produce final outputs.

The technical innovation: Automation through LLMs

What makes Open RAG Eval technically significant is how it uses large language models to automate what was previously a manual, labor-intensive evaluation process.

“The state of the art before we started, was left versus right comparisons,” Lin explained. “So this is, do you like the left one better? Do you like the right one better? Or they’re both good, or they’re both bad? That was sort of one way of doing things.”

Lin noted that the nugget-based evaluation approach itself isn’t new, but its automation through LLMs represents a breakthrough.

The framework uses Python with sophisticated prompt engineering to get LLMs to perform evaluation tasks like identifying nuggets and assessing hallucinations, all wrapped in a structured evaluation pipeline.

Competitive landscape: How Open RAG Eval fits into the evaluation ecosystem

As enterprise use of AI continues to mature, there is a growing number of evaluation frameworks. Just last week, Hugging Face launched Yourbench to test models against the company’s internal data. At the end of January, Galileo launched its Agentic Evaluations technology.

The Open RAG Eval is different in that it is strongly focussed on the RAG pipeline, not just LLM outputs.. The framework also has a strong academic foundation and is built on established information retrieval science rather than ad-hoc methods.

The framework builds on Vectara’s previous contributions to the open-source AI community, including its Hughes Hallucination Evaluation Model (HHEM), which has been downloaded over 3.5 million times on Hugging Face and has become a standard benchmark for hallucination detection.

“We’re not calling it the Vectara eval framework, we’re calling it the Open RAG Eval framework because we really want other companies and other institutions to start helping build this out,” Awadallah emphasized. “We need something like that in the market, for all of us, to make these systems evolve in the right way.”

What Open RAG Eval means in the real world

While still an early-stage effort, Vectara at least already has multiple users interested in using the Open RAG Eval framework.

Among them is Jeff Hummel, SVP of Product and Technology at real estate firm Anywhere.re. Hummel expects that partnering with Vectara will allow him to streamline his company’s RAG evaluation process.

Hummel noted that scaling his RAG deployment introduced significant challenges around infrastructure complexity, iteration velocity and rising costs. 

“Knowing the benchmarks and expectations in terms of performance and accuracy helps our team be predictive in our scaling calculations,” Hummel said. “To be frank, there weren’t a ton of frameworks for setting benchmarks on these attributes; we relied heavily on user feedback, which was sometimes objective and did translate to success at scale.”

From measurement to optimization: Practical applications for RAG implementers

For technical decision-makers, Open RAG Eval can help answer crucial questions about RAG deployment and configuration:

Whether to use fixed token chunking or semantic chunking

Whether to use hybrid or vector search, and what values to use for lambda in hybrid search

Which LLM to use and how to optimize RAG prompts

What thresholds to use for hallucination detection and correction

In practice, organizations can establish baseline scores for their existing RAG systems, make targeted configuration changes, and measure the resulting improvement. This iterative approach replaces guesswork with data-driven optimization.

While this initial release focuses on measurement, the roadmap includes optimization capabilities that could automatically suggest configuration improvements based on evaluation results. Future versions might also incorporate cost metrics to help organizations balance performance against operational expenses.

For enterprises looking to lead in AI adoption, Open RAG Eval means implementing a scientific approach to evaluation rather than relying on subjective assessments or vendor claims. For those earlier in their AI journey, it provides a structured way to approach evaluation from the beginning, potentially avoiding costly missteps as they build out their RAG infrastructure.

Daily insights on business use cases with VB Daily

If you want to impress your boss, VB Daily has you covered. We give you the inside scoop on what companies are doing with generative AI, from regulatory shifts to practical deployments, so you can share insights for maximum ROI.

Read our Privacy Policy

Thanks for subscribing. Check out more VB newsletters here.

An error occured.



Source link

Follow on Google News Follow on Flipboard
Share. Facebook Twitter Pinterest LinkedIn Tumblr Email Copy Link
Previous ArticleDeepFloyd IF – Pixel-Based Text-to-Image Diffusion (w/ Authors)
Next Article Global VC funding hits $113 billion in first quarter driven by outsized AI deals
Advanced AI Bot
  • Website

Related Posts

Agent-based computing is outgrowing the web as we know it

June 7, 2025

Sam Altman calls for ‘AI privilege’ as OpenAI clarifies court order to retain temporary and deleted ChatGPT sessions

June 6, 2025

Voice AI that actually converts: New TTS model boosts sales 15% for major brands

June 6, 2025
Leave A Reply Cancel Reply

Latest Posts

The Timeless Willie Nelson On Positive Thinking

Jiaxing Train Station By Architect Ma Yansong Is A Model Of People-Centric, Green Urban Design

Midwestern Grotto Tradition Celebrated In Sheboygan, WI

Hugh Jackman And Sonia Friedman Boldly Bid To Democratize Theater

Latest Posts

C3 AI Stock Is Soaring Today: Here’s Why – C3.ai (NYSE:AI)

June 8, 2025

Trump’s Tech Sanctions To Empower China, Betray America

June 8, 2025

Paper page – MINT-CoT: Enabling Interleaved Visual Tokens in Mathematical Chain-of-Thought Reasoning

June 8, 2025

Subscribe to News

Subscribe to our newsletter and never miss our latest news

Subscribe my Newsletter for New Posts & tips Let's stay updated!

Welcome to Advanced AI News—your ultimate destination for the latest advancements, insights, and breakthroughs in artificial intelligence.

At Advanced AI News, we are passionate about keeping you informed on the cutting edge of AI technology, from groundbreaking research to emerging startups, expert insights, and real-world applications. Our mission is to deliver high-quality, up-to-date, and insightful content that empowers AI enthusiasts, professionals, and businesses to stay ahead in this fast-evolving field.

Subscribe to Updates

Subscribe to our newsletter and never miss our latest news

Subscribe my Newsletter for New Posts & tips Let's stay updated!

YouTube LinkedIn
  • Home
  • About Us
  • Advertise With Us
  • Contact Us
  • DMCA
  • Privacy Policy
  • Terms & Conditions
© 2025 advancedainews. Designed by advancedainews.

Type above and press Enter to search. Press Esc to cancel.