Close Menu
  • Home
  • AI Models
    • DeepSeek
    • xAI
    • OpenAI
    • Meta AI Llama
    • Google DeepMind
    • Amazon AWS AI
    • Microsoft AI
    • Anthropic (Claude)
    • NVIDIA AI
    • IBM WatsonX Granite 3.1
    • Adobe Sensi
    • Hugging Face
    • Alibaba Cloud (Qwen)
    • Baidu (ERNIE)
    • C3 AI
    • DataRobot
    • Mistral AI
    • Moonshot AI (Kimi)
    • Google Gemma
    • xAI
    • Stability AI
    • H20.ai
  • AI Research
    • Allen Institue for AI
    • arXiv AI
    • Berkeley AI Research
    • CMU AI
    • Google Research
    • Microsoft Research
    • Meta AI Research
    • OpenAI Research
    • Stanford HAI
    • MIT CSAIL
    • Harvard AI
  • AI Funding & Startups
    • AI Funding Database
    • CBInsights AI
    • Crunchbase AI
    • Data Robot Blog
    • TechCrunch AI
    • VentureBeat AI
    • The Information AI
    • Sifted AI
    • WIRED AI
    • Fortune AI
    • PitchBook
    • TechRepublic
    • SiliconANGLE – Big Data
    • MIT News
    • Data Robot Blog
  • Expert Insights & Videos
    • Google DeepMind
    • Lex Fridman
    • Matt Wolfe AI
    • Yannic Kilcher
    • Two Minute Papers
    • AI Explained
    • TheAIEdge
    • Matt Wolfe AI
    • The TechLead
    • Andrew Ng
    • OpenAI
  • Expert Blogs
    • François Chollet
    • Gary Marcus
    • IBM
    • Jack Clark
    • Jeremy Howard
    • Melanie Mitchell
    • Andrew Ng
    • Andrej Karpathy
    • Sebastian Ruder
    • Rachel Thomas
    • IBM
  • AI Policy & Ethics
    • ACLU AI
    • AI Now Institute
    • Center for AI Safety
    • EFF AI
    • European Commission AI
    • Partnership on AI
    • Stanford HAI Policy
    • Mozilla Foundation AI
    • Future of Life Institute
    • Center for AI Safety
    • World Economic Forum AI
  • AI Tools & Product Releases
    • AI Assistants
    • AI for Recruitment
    • AI Search
    • Coding Assistants
    • Customer Service AI
    • Image Generation
    • Video Generation
    • Writing Tools
    • AI for Recruitment
    • Voice/Audio Generation
  • Industry Applications
    • Finance AI
    • Healthcare AI
    • Legal AI
    • Manufacturing AI
    • Media & Entertainment
    • Transportation AI
    • Education AI
    • Retail AI
    • Agriculture AI
    • Energy AI
  • AI Art & Entertainment
    • AI Art News Blog
    • Artvy Blog » AI Art Blog
    • Weird Wonderful AI Art Blog
    • The Chainsaw » AI Art
    • Artvy Blog » AI Art Blog
What's Hot

Reddit Sues Anthropic for Scraping Content to Train Claude AI

Google DeepMind’s CEO Thinks AI Will Make Humans Less Selfish

What Is ChatGPT? Everything You Need to Know About OpenAI’s Popular Chatbot

Facebook X (Twitter) Instagram
Advanced AI News
  • Home
  • AI Models
    • Adobe Sensi
    • Aleph Alpha
    • Alibaba Cloud (Qwen)
    • Amazon AWS AI
    • Anthropic (Claude)
    • Apple Core ML
    • Baidu (ERNIE)
    • ByteDance Doubao
    • C3 AI
    • Cohere
    • DataRobot
    • DeepSeek
  • AI Research & Breakthroughs
    • Allen Institue for AI
    • arXiv AI
    • Berkeley AI Research
    • CMU AI
    • Google Research
    • Meta AI Research
    • Microsoft Research
    • OpenAI Research
    • Stanford HAI
    • MIT CSAIL
    • Harvard AI
  • AI Funding & Startups
    • AI Funding Database
    • CBInsights AI
    • Crunchbase AI
    • Data Robot Blog
    • TechCrunch AI
    • VentureBeat AI
    • The Information AI
    • Sifted AI
    • WIRED AI
    • Fortune AI
    • PitchBook
    • TechRepublic
    • SiliconANGLE – Big Data
    • MIT News
    • Data Robot Blog
  • Expert Insights & Videos
    • Google DeepMind
    • Lex Fridman
    • Meta AI Llama
    • Yannic Kilcher
    • Two Minute Papers
    • AI Explained
    • TheAIEdge
    • Matt Wolfe AI
    • The TechLead
    • Andrew Ng
    • OpenAI
  • Expert Blogs
    • François Chollet
    • Gary Marcus
    • IBM
    • Jack Clark
    • Jeremy Howard
    • Melanie Mitchell
    • Andrew Ng
    • Andrej Karpathy
    • Sebastian Ruder
    • Rachel Thomas
    • IBM
  • AI Policy & Ethics
    • ACLU AI
    • AI Now Institute
    • Center for AI Safety
    • EFF AI
    • European Commission AI
    • Partnership on AI
    • Stanford HAI Policy
    • Mozilla Foundation AI
    • Future of Life Institute
    • Center for AI Safety
    • World Economic Forum AI
  • AI Tools & Product Releases
    • AI Assistants
    • AI for Recruitment
    • AI Search
    • Coding Assistants
    • Customer Service AI
    • Image Generation
    • Video Generation
    • Writing Tools
    • AI for Recruitment
    • Voice/Audio Generation
  • Industry Applications
    • Education AI
    • Energy AI
    • Finance AI
    • Healthcare AI
    • Legal AI
    • Media & Entertainment
    • Transportation AI
    • Manufacturing AI
    • Retail AI
    • Agriculture AI
  • AI Art & Entertainment
    • AI Art News Blog
    • Artvy Blog » AI Art Blog
    • Weird Wonderful AI Art Blog
    • The Chainsaw » AI Art
    • Artvy Blog » AI Art Blog
Advanced AI News
Home » BenchmarkQED: Automated benchmarking of RAG systems – Microsoft Research
Microsoft Research

BenchmarkQED: Automated benchmarking of RAG systems – Microsoft Research

Advanced AI BotBy Advanced AI BotJune 5, 2025No Comments6 Mins Read
Share Facebook Twitter Pinterest Copy Link Telegram LinkedIn Tumblr Email
Share
Facebook Twitter LinkedIn Pinterest Email


Diagram showing how the dimensions of query source (data-driven vs activity-driven) and query scope (local vs global) create four query classes that span the local-to-global query spectrum: data-local, activity-local, data-global, and activity-global.

One of the key use cases for generative AI involves answering questions over private datasets, with retrieval-augmented generation (RAG) as the go-to framework. As new RAG techniques emerge, there’s a growing need to benchmark their performance across diverse datasets and metrics. 

To meet this need, we’re introducing BenchmarkQED, a new suite of tools that automates RAG benchmarking at scale, available on GitHub (opens in new tab). It includes components for query generation, evaluation, and dataset preparation, each designed to support rigorous, reproducible testing.  

BenchmarkQED complements the RAG methods in our open-source GraphRAG library, enabling users to run a GraphRAG-style evaluation across models, metrics, and datasets. GraphRAG uses a large language model (LLM) to generate and summarize entity-based knowledge graphs, producing more comprehensive and diverse answers than standard RAG for large-scale tasks. 

In this post, we walk through the core components of BenchmarkQED that contribute to the overall benchmarking process. We also share some of the latest benchmark results comparing our LazyGraphRAG system to competing methods, including a vector-based RAG with a 1M-token context window, where the leading LazyGraphRAG configuration showed significant win rates across all combinations of quality metrics and query classes.

In the paper, we distinguish between local queries, where answers are found in a small number of text regions, and sometimes even a single region, and global queries, which require reasoning over large portions of or even the entire dataset. 

Conventional vector-based RAG excels at local queries because the regions containing the answer to the query resemble the query itself and can be retrieved as the nearest neighbor in the vector space of text embeddings. However, it struggles with global questions, such as, “What are the main themes of the dataset?” which require understanding dataset qualities not explicitly stated in the text.  

AutoQ: Automated query synthesis

This limitation motivated the development of GraphRAG a system designed to answer global queries. GraphRAG’s evaluation requirements subsequently led to the creation of AutoQ, a method for synthesizing these global queries for any dataset.

AutoQ extends this approach by generating synthetic queries across the spectrum of queries, from local to global. It defines four distinct classes based on the source and scope of the query (Figure 1, top) forming a logical progression along the spectrum (Figure 1, bottom).

Diagram showing how the dimensions of query source (data-driven vs activity-driven) and query scope (local vs global) create four query classes that span the local-to-global query spectrum: data-local, activity-local, data-global, and activity-global.
Figure 1. Construction of a 2×2 design space for synthetic query generation with AutoQ, showing how the four resulting query classes map onto the local-global query spectrum. 

AutoQ can be configured to generate any number and distribution of synthetic queries along these classes, enabling consistent benchmarking across datasets without requiring user customization. Figure 2 shows the synthesis process and sample queries from each class, using an AP News dataset.

Diagram showing the processes for synthesizing queries in each of the four classes. Each process involves steps like generating dataset summaries, personas, tasks, and candidate queries, followed by clustering candidate queries and selecting the final query set. The data-local example query is “Why are junior doctors in South Korea striking in February 2024?”. The activity-local example query is “What are the public health implications of the newly discovered Alaskapox virus in Alaska?”. The data-global example query is “Across the dataset, what are the key public health challenges and the measures being taken to address them?”. The activity-global example query is “Across the dataset, what are the main public health initiatives mentioned that target underserved communities?”.
Figure 2. Synthesis process and example query for each of the four AutoQ query classes. 

Spotlight: blog post

GraphRAG auto-tuning provides rapid adaptation to new domains

GraphRAG uses LLM-generated knowledge graphs to substantially improve complex Q&A over retrieval-augmented generation (RAG). Discover automatic tuning of GraphRAG for new datasets, making it more accurate and relevant.

Opens in a new tab

AutoE: Automated evaluation framework 

Our evaluation of GraphRAG focused on analyzing key qualities of answers to global questions. The following qualities were used for the current evaluation:

Comprehensiveness: Does the answer address all relevant aspects of the question? 

Diversity: Does it present varied perspectives or insights? 

Empowerment: Does it help the reader understand and make informed judgments? 

Relevance: Does it address what the question is specifically asking?  

The AutoE component scales evaluation of these qualities using the LLM-as-a-Judge method. It presents pairs of answers to an LLM, along with the query and target metric, in counterbalanced order. The model determines whether the first answer wins, loses, or ties with the second. Over a set of queries, whether from AutoQ or elsewhere, this produces win rates between competing methods. When ground truth is available, AutoE can also score answers on correctness, completeness, and related metrics.

An illustrative evaluation is shown in Figure 3. Using a dataset of 1,397 AP News articles on health and healthcare, AutoQ generated 50 queries per class (200 total). AutoE then compared LazyGraphRAG to a competing RAG method, running six trials per query across four metrics, using GPT-4.1 as a judge.

These trial-level results were aggregated using metric-based win rates, where each trial is scored 1 for a win, 0.5 for a tie, and 0 for a loss, and then averaged to calculate the overall win rate for each RAG method.

Bar charts with the y-axes representing win rates for LazyGraphRAG conditions. The x-axes contain a range of comparison conditions. Bars are clustered by LazyGraphRAG (LGR) condition and charts are faceted by query class.
Figure 3. Win rates of four LazyGraphRAG (LGR) configurations across methods, broken down by the AutoQ query class and averaged across AutoE’s four metrics: comprehensiveness, diversity, empowerment, and relevance. LazyGraphRAG outperforms comparison conditions where the bar is above 50%.

The four LazyGraphRAG conditions (LGR_b200_c200, LGR_b50_c200, LGR_b50_c600, LGR_b200_c200_mini) differ by query budget (b50, b200) and chunk size (c200, c600). All used GPT-4o mini for relevance tests and GPT-4o for query expansion (to five subqueries) and answer generation, except for LGR_b200_c200_mini, which used GPT-4o mini throughout.

Comparison systems were GraphRAG (Local, Global, and Drift Search), Vector RAG with 8k- and 120k-token windows, and three published methods: LightRAG (opens in new tab), RAPTOR (opens in new tab), and TREX (opens in new tab). All methods were limited to the same 8k tokens for answer generation. GraphRAG Global Search used level 2 of the community hierarchy.

LazyGraphRAG outperformed every comparison condition using the same generative model (GPT-4o), winning all 96 comparisons, with all but one reaching statistical significance. The best overall performance came from the larger budget, smaller chunk size configuration (LGR_b200_c200). For DataLocal queries, the smaller budget (LGR_b50_c200) performed slightly better, likely because fewer chunks were relevant. For ActivityLocal queries, the larger chunk size (LGR_b50_c600) had a slight edge, likely because longer chunks provide a more coherent context.

Competing methods performed relatively better on the query classes for which they were designed: GraphRAG Global for global queries, Vector RAG for local queries, and GraphRAG Drift Search, which combines both strategies, posed the strongest challenge overall.

Increasing Vector RAG’s context window from 8k to 120k tokens did not improve its performance compared to LazyGraphRAG. This raised the question of how LazyGraphRAG would perform against Vector RAG with 1-million token context window containing most of the dataset.

Figure 4 shows the follow-up experiment comparing LazyGraphRAG to Vector RAG using GPT-4.1 that enabled this comparison. Even against the 1M-token window, LazyGraphRAG achieved higher win rates across all comparisons, failing to reach significance only for the relevance of answers to DataLocal queries. These queries tend to benefit most from Vector RAG’s ranking of directly relevant chunks, making it hard for LazyGraphRAG to generate answers that have greater relevance to the query, even though these answers may be dramatically more comprehensive, diverse, and empowering overall.

Bar charts with the y-axes representing win rates for LazyGraphRAG. The x-axes contain comparison conditions for vector-based RAG with 8 thousand, 120 thousand, and 1 million token context windows. Charts are faceted by query class and quality metric.
Figure 4. Win rates of LazyGraphRAG (LGR) over Vector RAG across different context window sizes, broken down by the four AutoQ query classes and four AutoE metrics: comprehensiveness, diversity, empowerment, and relevance. Bars above 50% indicate that LazyGraphRAG outperformed the comparison condition. 

AutoD: Automated data sampling and summarization

Text datasets have an underlying topical structure, but the depth, breadth, and connectivity of that structure can vary widely. This variability makes it difficult to evaluate RAG systems consistently, as results may reflect the idiosyncrasies of the dataset rather than the system’s general capabilities.

The AutoD component addresses this by sampling datasets to meet a target specification, defined by the number of topic clusters (breadth) and the number of samples per cluster (depth). This creates consistency across datasets, enabling more meaningful comparisons, as structurally aligned datasets lead to comparable AutoQ queries, which in turn support consistent AutoE evaluations.

AutoD also includes tools for summarizing input or output datasets in a way that reflects their topical coverage. These summaries play an important role in the AutoQ query synthesis process, but they can also be used more broadly, such as in prompts where context space is limited.

Since the release of the GraphRAG paper, we’ve received many requests to share the dataset of the Behind the Tech (opens in new tab) podcast transcripts we used in our evaluation. An updated version of this dataset is now available in the BenchmarkQED repository (opens in new tab), alongside the AP News dataset containing 1,397 health-related articles, licensed for open release.  

We hope these datasets, together with the BenchmarkQED tools (opens in new tab), help accelerate benchmark-driven development of RAG systems and AI question-answering. We invite the community to try them on GitHub (opens in new tab). 

Opens in a new tab



Source link

Follow on Google News Follow on Flipboard
Share. Facebook Twitter Pinterest LinkedIn Tumblr Email Copy Link
Previous ArticleStanford HAI’s 2025 AI Index Reveals Record Growth in AI Capabilities, Investment, and Regulation
Next Article SUMO-MCP: Leveraging the Model Context Protocol for Autonomous Traffic Simulation and Optimization
Advanced AI Bot
  • Website

Related Posts

What AI’s impact on individuals means for the health workforce and industry

May 29, 2025

FrodoKEM: A conservative quantum-safe cryptographic algorithm

May 27, 2025

Abstracts: Zero-shot models in single-cell biology with Alex Lu

May 22, 2025
Leave A Reply Cancel Reply

Latest Posts

Jiaxing Train Station By Architect Ma Yansong Is A Model Of People-Centric, Green Urban Design

Hugh Jackman And Sonia Friedman Boldly Bid To Democratize Theater

Men’s Swimwear Gets Casual At Miami Swim Week 2025

Original Prototype for Jane Birkin’s Hermes Bag Consigned to Sotheby’s

Latest Posts

Reddit Sues Anthropic for Scraping Content to Train Claude AI

June 7, 2025

Google DeepMind’s CEO Thinks AI Will Make Humans Less Selfish

June 7, 2025

What Is ChatGPT? Everything You Need to Know About OpenAI’s Popular Chatbot

June 7, 2025

Subscribe to News

Subscribe to our newsletter and never miss our latest news

Subscribe my Newsletter for New Posts & tips Let's stay updated!

Welcome to Advanced AI News—your ultimate destination for the latest advancements, insights, and breakthroughs in artificial intelligence.

At Advanced AI News, we are passionate about keeping you informed on the cutting edge of AI technology, from groundbreaking research to emerging startups, expert insights, and real-world applications. Our mission is to deliver high-quality, up-to-date, and insightful content that empowers AI enthusiasts, professionals, and businesses to stay ahead in this fast-evolving field.

Subscribe to Updates

Subscribe to our newsletter and never miss our latest news

Subscribe my Newsletter for New Posts & tips Let's stay updated!

YouTube LinkedIn
  • Home
  • About Us
  • Advertise With Us
  • Contact Us
  • DMCA
  • Privacy Policy
  • Terms & Conditions
© 2025 advancedainews. Designed by advancedainews.

Type above and press Enter to search. Press Esc to cancel.