Close Menu
  • Home
  • AI Models
    • DeepSeek
    • xAI
    • OpenAI
    • Meta AI Llama
    • Google DeepMind
    • Amazon AWS AI
    • Microsoft AI
    • Anthropic (Claude)
    • NVIDIA AI
    • IBM WatsonX Granite 3.1
    • Adobe Sensi
    • Hugging Face
    • Alibaba Cloud (Qwen)
    • Baidu (ERNIE)
    • C3 AI
    • DataRobot
    • Mistral AI
    • Moonshot AI (Kimi)
    • Google Gemma
    • xAI
    • Stability AI
    • H20.ai
  • AI Research
    • Allen Institue for AI
    • arXiv AI
    • Berkeley AI Research
    • CMU AI
    • Google Research
    • Microsoft Research
    • Meta AI Research
    • OpenAI Research
    • Stanford HAI
    • MIT CSAIL
    • Harvard AI
  • AI Funding & Startups
    • AI Funding Database
    • CBInsights AI
    • Crunchbase AI
    • Data Robot Blog
    • TechCrunch AI
    • VentureBeat AI
    • The Information AI
    • Sifted AI
    • WIRED AI
    • Fortune AI
    • PitchBook
    • TechRepublic
    • SiliconANGLE – Big Data
    • MIT News
    • Data Robot Blog
  • Expert Insights & Videos
    • Google DeepMind
    • Lex Fridman
    • Matt Wolfe AI
    • Yannic Kilcher
    • Two Minute Papers
    • AI Explained
    • TheAIEdge
    • Matt Wolfe AI
    • The TechLead
    • Andrew Ng
    • OpenAI
  • Expert Blogs
    • François Chollet
    • Gary Marcus
    • IBM
    • Jack Clark
    • Jeremy Howard
    • Melanie Mitchell
    • Andrew Ng
    • Andrej Karpathy
    • Sebastian Ruder
    • Rachel Thomas
    • IBM
  • AI Policy & Ethics
    • ACLU AI
    • AI Now Institute
    • Center for AI Safety
    • EFF AI
    • European Commission AI
    • Partnership on AI
    • Stanford HAI Policy
    • Mozilla Foundation AI
    • Future of Life Institute
    • Center for AI Safety
    • World Economic Forum AI
  • AI Tools & Product Releases
    • AI Assistants
    • AI for Recruitment
    • AI Search
    • Coding Assistants
    • Customer Service AI
    • Image Generation
    • Video Generation
    • Writing Tools
    • AI for Recruitment
    • Voice/Audio Generation
  • Industry Applications
    • Finance AI
    • Healthcare AI
    • Legal AI
    • Manufacturing AI
    • Media & Entertainment
    • Transportation AI
    • Education AI
    • Retail AI
    • Agriculture AI
    • Energy AI
  • AI Art & Entertainment
    • AI Art News Blog
    • Artvy Blog » AI Art Blog
    • Weird Wonderful AI Art Blog
    • The Chainsaw » AI Art
    • Artvy Blog » AI Art Blog
What's Hot

OpenAI Is Poised to Become the Most Valuable Startup Ever. Should It Be?

VB AI Impact Series: Can you really govern multi-agent AI?

Meta rolls out AI-powered translations to creators globally, starting with English and Spanish

Facebook X (Twitter) Instagram
Advanced AI News
  • Home
  • AI Models
    • OpenAI (GPT-4 / GPT-4o)
    • Anthropic (Claude 3)
    • Google DeepMind (Gemini)
    • Meta (LLaMA)
    • Cohere (Command R)
    • Amazon (Titan)
    • IBM (Watsonx)
    • Inflection AI (Pi)
  • AI Research
    • Allen Institue for AI
    • arXiv AI
    • Berkeley AI Research
    • CMU AI
    • Google Research
    • Meta AI Research
    • Microsoft Research
    • OpenAI Research
    • Stanford HAI
    • MIT CSAIL
    • Harvard AI
  • AI Funding
    • AI Funding Database
    • CBInsights AI
    • Crunchbase AI
    • Data Robot Blog
    • TechCrunch AI
    • VentureBeat AI
    • The Information AI
    • Sifted AI
    • WIRED AI
    • Fortune AI
    • PitchBook
    • TechRepublic
    • SiliconANGLE – Big Data
    • MIT News
    • Data Robot Blog
  • AI Experts
    • Google DeepMind
    • Lex Fridman
    • Meta AI Llama
    • Yannic Kilcher
    • Two Minute Papers
    • AI Explained
    • TheAIEdge
    • The TechLead
    • Matt Wolfe AI
    • Andrew Ng
    • OpenAI
    • Expert Blogs
      • François Chollet
      • Gary Marcus
      • IBM
      • Jack Clark
      • Jeremy Howard
      • Melanie Mitchell
      • Andrew Ng
      • Andrej Karpathy
      • Sebastian Ruder
      • Rachel Thomas
      • IBM
  • AI Tools
    • AI Assistants
    • AI for Recruitment
    • AI Search
    • Coding Assistants
    • Customer Service AI
  • AI Policy
    • ACLU AI
    • AI Now Institute
    • Center for AI Safety
  • Business AI
    • Advanced AI News Features
    • Finance AI
    • Healthcare AI
    • Education AI
    • Energy AI
    • Legal AI
LinkedIn Instagram YouTube Threads X (Twitter)
Advanced AI News
Microsoft Research

VeriTrail: Detecting hallucination and tracing provenance in multi-step AI workflows

By Advanced AI EditorAugust 5, 2025No Comments11 Mins Read
Share Facebook Twitter Pinterest Copy Link Telegram LinkedIn Tumblr Email
Share
Facebook Twitter LinkedIn Pinterest Email


Alt text: Two white icons on a blue-to-green gradient background—one showing a central figure linked to others, representing a network, and the other depicting lines connecting to a document, symbolizing data flow.

Many applications of language models (LMs) involve generating content based on source material, such as answering questions, summarizing information, and drafting documents. A critical challenge for these applications is that LMs may produce content that is not supported by the source text – a phenomenon known as “closed-domain hallucination.”1

Existing methods for detecting closed-domain hallucination typically compare a given LM output to the source text, implicitly assuming that there is only a single output to evaluate. However, applications of LMs increasingly involve processes with multiple generative steps: LMs generate intermediate outputs that serve as inputs to subsequent steps and culminate in a final output. Many agentic workflows follow this paradigm (e.g., each agent is responsible for a specific document or sub-task, and their outputs are synthesized into a final response).  

In our paper “VeriTrail: Closed-Domain Hallucination Detection with Traceability,” we argue that, given the complexity of processes with multiple generative steps, detecting hallucination in the final output is necessary but not sufficient. We also need traceability, which has two components: 

Provenance: if the final output is supported by the source text, we should be able to trace its path through the intermediate outputs to the source. 

Error Localization: if the final output is not supported by the source text, we should be able to trace where the error was likely introduced.

Our paper presents VeriTrail, the first closed-domain hallucination detection method designed to provide traceability for processes with any number of generative steps. We also demonstrate that VeriTrail outperforms baseline methods commonly used for hallucination detection. In this blog post, we provide an overview of VeriTrail’s design and performance.2

VeriTrail’s hallucination detection process

A key idea leveraged by VeriTrail is that a wide range of generative processes can be represented as a directed acyclic graph (DAG). Each node in the DAG represents a piece of text (i.e., source material, an intermediate output, or the final output) and each edge from node A to node B indicates that A was used as an input to produce B. Each node is assigned a unique ID, as well as a stage reflecting its position in the generative process.  

An example of a process with multiple generative steps is GraphRAG. A DAG representing a GraphRAG run is illustrated in Figure 1, where the boxes and arrows correspond to nodes and edges, respectively.3

A GraphRAG run is depicted as a directed acyclic graph. The Stage 1 nodes represent source text chunks. Each Stage 1 node has an edge pointing to a Stage 2 node, which corresponds to an entity or a relationship. Entity 3 was extracted from two source text chunks, so its descriptions are summarized. The summarized entity description forms a Stage 3 node. The Stage 2 and 3 nodes have edges pointing to Stage 4 nodes, which represent community reports. The Stage 4 nodes have edges pointing to Stage 5 nodes, which correspond to map-level answers. The Stage 5 nodes each have an edge pointing to the terminal node, which represents the final answer. The terminal node is the only node in Stage 6.
Figure 1: GraphRAG splits the source text into chunks (Stage 1). For each chunk, an LM extracts entities and relationships (the latter are denoted by “⭤ “), along with short descriptions (Stage 2). If an entity or a relationship was extracted from multiple chunks, an LM summarizes the descriptions (Stage 3). A knowledge graph is constructed from the final set of entities and relationships, and a community detection algorithm, such as Leiden clustering, groups entities into communities. For each community, an LM generates a “community report” that summarizes the entities and relationships (Stage 4). To answer a user’s question, an LM generates “map-level answers” based on groups of community reports (Stage 5), then synthesizes them into a final answer (Stage 6).

VeriTrail takes as input a DAG representing a completed generative process and aims to determine whether the final output is fully supported by the source text. It begins by extracting claims (i.e., self-contained, verifiable statements) from the final output using Claimify. VeriTrail verifies claims in the reverse order of the generative process: it starts from the final output and moves toward the source text. Each claim is verified separately. Below, we include two case studies that illustrate how VeriTrail works, using the DAG from Figure 1. 

Case study 1: A “Fully Supported” claim

An example of VeriTrail's claim verification process where the claim is found “Fully Supported.” A claim extracted from the terminal node, Node 17, is “Legislative efforts have been made to address the high cost of diabetes-related supplies in the US.” In Iteration 1, VeriTrail checks Nodes 15 and 16, which are the source nodes of the terminal node. The sentence “The general assembly in North Carolina is considering legislation to set a cap on insulin prices, which indicates that high insulin prices are a contributing factor to the high cost of diabetes-related supplies in the US” is selected as evidence from Node 15. The tentative verdict is “Fully Supported.” In Iteration 2, VeriTrail checks Nodes 12 and 13, which are the source nodes of Node 15. The sentence “The General Assembly in North Carolina is considering legislation to set a cap on insulin prices” is selected as evidence from Node 13. The verdict remains “Fully Supported.” In Iteration 3, VeriTrail checks Nodes 4, 5, and 11, which are the source nodes of Node 13. The sentence “The General Assembly is the legislative body in North Carolina considering legislation to cap insulin prices” is selected as evidence from Node 4. The verdict is still “Fully Supported.” In Iteration 4, VeriTrail checks Node 1, which is the source node of Node 4. The selected evidence is “‘There’s actually legislation in North Carolina at the General Assembly to set a cap on insulin…’ Stein said.” The corresponding verdict is “Fully Supported.” Since Node 1 represents a raw text chunk, it does not have any source nodes to check. Therefore, verification terminates and the “Fully Supported” verdict is deemed final.
Figure 2: Left: GraphRAG as a DAG. Right: VeriTrail’s hallucination detection process for a “Fully Supported” claim.

Figure 2 shows an example of a claim that VeriTrail determined was not hallucinated: 

In Iteration 1, VeriTrail identified the nodes that were used as inputs for the final answer: Nodes 15 and 16. Each identified node was split into sentences, and each sentence was programmatically assigned a unique ID.

An LM then performed Evidence Selection, selecting all sentence IDs that strongly implied the truth or falsehood of the claim. The LM also generated a summary of the selected sentences (not shown in Figure 2). In this example, a sentence was selected from Node 15.

Next, an LM performed Verdict Generation. If no sentences had been selected in the Evidence Selection step, the claim would have been assigned a “Not Fully Supported” verdict. Instead, an LM was prompted to classify the claim as “Fully Supported,” “Not Fully Supported,” or “Inconclusive” based on the evidence. In this case, the verdict was “Fully Supported.”

Since the verdict in Iteration 1 was “Fully Supported,” VeriTrail proceeded to Iteration 2. It considered the nodes from which at least one sentence was selected in the latest Evidence Selection step (Node 15) and identified their input nodes (Nodes 12 and 13). VeriTrail repeated Evidence Selection and Verdict Generation for the identified nodes. Once again, the verdict was “Fully Supported.” This process – identifying candidate nodes, performing Evidence Selection and Verdict Generation – was repeated in Iteration 3, where the verdict was still “Fully Supported,” and likewise in Iteration 4. 

In Iteration 4, a single source text chunk was verified. Since the source text, by definition, does not have any inputs, verification terminated and the verdict was deemed final.

Case study 2: A “Not Fully Supported” claim

An example of VeriTrail's claim verification process where the claim is found “Not Fully Supported.” We assume that the maximum number of consecutive “Not Fully Supported” verdicts was set to 2. A claim extracted from the terminal node, Node 17, is “Challenges related to electric vehicle battery repairability contribute to sluggish retail auto sales in China.” In Iteration 1, VeriTrail checks Nodes 15 and 16, which are the source nodes of the terminal node. Two sentences are selected as evidence. The first sentence is “Challenges with electric vehicle (EV) battery disposal and repair may also contribute to the sluggishness in retail auto sales.” The second sentence is “Junkyards are accumulating discarded EV battery packs, while collision shops face limitations in repairing EV battery packs, which could affect consumer confidence and demand.” These sentences are both from Node 15. The tentative verdict is “Not Fully Supported.” In Iteration 2, VeriTrail checks Nodes 12, 13, and 14. Nodes 12 and 13 are the source nodes of Node 15. Node 14 is the source node of Node 16, which was checked in Iteration 1. The sentence “The electric vehicle market in China is influenced by challenges associated with EV battery disposal and repair” is selected as evidence from Node 12. The verdict remains “Not Fully Supported.” Since two consecutive “Not Fully Supported” verdicts have been reached, which was the maximum, verification terminates and the final verdict is “Not Fully Supported.”
Figure 3: Left: GraphRAG as a DAG. Right: VeriTrail’s hallucination detection process for a “Not Fully Supported” claim, where the maximum number of consecutive “Not Fully Supported” verdicts was set to 2.

Figure 3 provides an example of a claim where VeriTrail identified hallucination:

In Iteration 1, VeriTrail identified the nodes used as inputs for the final answer: Nodes 15 and 16. After Evidence Selection and Verdict Generation, the verdict was “Not Fully Supported.” Users can configure the maximum number of consecutive “Not Fully Supported” verdicts permitted. If the maximum had been set to 1, verification would have terminated here, and the verdict would have been deemed final. Let’s assume the maximum was set to 2, meaning that VeriTrail had to perform at least one more iteration.

Even though evidence was selected only from Node 15 in Iteration 1, VeriTrail checked the input nodes for both Node 15 and Node 16 (i.e., Nodes 12, 13, and 14) in Iteration 2. Recall that in Case Study 1 where the verdict was “Fully Supported,” VeriTrail only checked the input nodes for Node 15. Why was the “Not Fully Supported” claim handled differently? If the Evidence Selection step overlooked relevant evidence, the “Not Fully Supported” verdict might be incorrect. In this case, continuing verification based solely on the selected evidence (i.e., Node 15) would propagate the mistake, defeating the purpose of repeated verification.

In Iteration 2, Evidence Selection and Verdict Generation were repeated for Nodes 12, 13, and 14. Once again, the verdict was “Not Fully Supported.” Since this was the second consecutive “Not Fully Supported” verdict, verification terminated and the verdict was deemed final.

Spotlight: Event Series

Microsoft Research Forum

Join us for a continuous exchange of ideas about research in the era of general AI. Watch the first four episodes on demand.

Opens in a new tab

Providing traceability

In addition to assigning a final “Fully Supported,” “Not Fully Supported,” or “Inconclusive” verdict to each claim, VeriTrail returns (a) all Verdict Generation results and (b) an evidence trail composed of all Evidence Selection results: the selected sentences, their corresponding node IDs, and the generated summaries. Collectively, these outputs provide traceability: 

Provenance: For “Fully Supported” and “Inconclusive” claims, the evidence trail traces a path from the source material to the final output, helping users understand how the output may have been derived. For example, in Case Study 1, the evidence trail consists of Sentence 8 from Node 15, Sentence 11 from Node 13, Sentence 26 from Node 4, and Sentence 79 from Node 1.

Error Localization: For “Not Fully Supported” claims, VeriTrail uses the Verdict Generation results to identify the stage(s) of the process where the unsupported content was likely introduced. For instance, in Case Study 2, where none of the verified intermediate outputs supported the claim, VeriTrail would indicate that the hallucination occurred in the final answer (Stage 6). Error stage identification helps users address hallucinations and understand where in the process they are most likely to occur. 

The evidence trail also helps users verify the verdict: instead of reading through all nodes – which may be infeasible for processes that generate large amounts of text – users can simply review the evidence sentences and summaries. 

Key design features

VeriTrail’s design prioritizes reliability, efficiency, scalability, and user agency. Notable features include: 

During Evidence Selection (introduced in Case Study 1), the sentence IDs returned by the LM are checked against the programmatically assigned IDs. If a returned ID does not match an assigned ID, it is discarded; otherwise, it is mapped to its corresponding sentence. This approach guarantees that the sentences included in the evidence trail are not hallucinated.

After a claim is assigned an interim “Fully Supported” or “Inconclusive” verdict (as in Case Study 1), VeriTrail verifies the input nodes of only the nodes from which evidence was previously selected – not all possible input nodes. By progressively narrowing the search space, VeriTrail limits the number of nodes the LM must evaluate. In particular, since VeriTrail starts from the final output and moves toward the source text, it tends to verify a smaller proportion of nodes as it approaches the source text. Nodes closer to the source text tend to be larger (e.g., a book chapter should be larger than its summary), so verifying fewer of them helps reduce computational cost.

VeriTrail is designed to handle input graphs with any number of nodes, regardless of whether they fit in a single prompt. Users can specify an input size limit per prompt. For Evidence Selection, inputs that exceed the limit are split across multiple prompts. If the resulting evidence exceeds the input size limit for Verdict Generation, VeriTrail reruns Evidence Selection to compress the evidence further. Users can configure the maximum number of Evidence Selection reruns.  

The configurable maximum number of consecutive “Not Fully Supported” verdicts (introduced in Case Study 2) allows the user to find their desired balance between computational cost and how conservative VeriTrail is in flagging hallucinations. A lower maximum reduces cost by limiting the number of checks. A higher maximum increases confidence that a flagged claim is truly hallucinated since it requires repeated confirmation of the “Not Fully Supported” verdict. 

Evaluating VeriTrail’s performance

We tested VeriTrail on two datasets covering distinct generative processes (hierarchical summarization4 and GraphRAG), tasks (summarization and question-answering), and types of source material (fiction novels and news articles). For the source material, we focused on long documents and large collections of documents (i.e., >100K tokens), where hallucination detection is especially challenging and processes with multiple generative steps are typically most valuable. The resulting DAGs were much more complex than the examples provided above (e.g., in one of the datasets, the average number of nodes was 114,368).

We compared VeriTrail to three types of baseline methods commonly used for closed-domain hallucination detection: Natural Language Inference models (AlignScore and INFUSE); Retrieval-Augmented Generation; and long-context models (Gemini 1.5 Pro and GPT-4.1 mini). Across both datasets and all language models tested, VeriTrail outperformed the baseline methods in detecting hallucination.5

Most importantly, VeriTrail traces claims through intermediate outputs – unlike the baseline methods, which directly compare the final output to the source material. As a result, it can identify where hallucinated content was likely introduced and how faithful content may have been derived from the source. By providing traceability, VeriTrail brings transparency to generative processes, helping users understand, verify, debug, and, ultimately, trust their outputs.  

For an in-depth discussion of VeriTrail, please see our paper “VeriTrail: Closed-Domain Hallucination Detection with Traceability.”

1 (opens in new tab) The term “closed-domain hallucination” was introduced by OpenAI in the GPT-4 Technical Report (opens in new tab).

2 VeriTrail is currently used for research purposes only and is not available commercially.

3 We focus on GraphRAG’s global search method.

4 (opens in new tab) In hierarchical summarization, an LM summarizes each source text chunk individually, then the resulting summaries are repeatedly grouped and summarized until a final summary is produced (Wu et al., 2021 (opens in new tab); Chang et al., 2023 (opens in new tab)).

5 The only exception was the mistral-large-2411 model, where VeriTrail had the highest balanced accuracy, but not the highest macro F1 score.

Opens in a new tab



Source link

Follow on Google News Follow on Flipboard
Share. Facebook Twitter Pinterest LinkedIn Tumblr Email Copy Link
Previous ArticleWhat Enterprise Leaders Get Wrong About Conversational AI and How to Fix It 
Next Article Tesla Cybertruck leftovers are the main course at the Supercharger Diner
Advanced AI Editor
  • Website

Related Posts

Dion: the distributed orthonormal update revolution is here

August 12, 2025

Reimagining healthcare delivery and public health with AI

August 7, 2025

Self-adaptive reasoning for science – Microsoft Research

August 6, 2025

Comments are closed.

Latest Posts

Senator Seeks Investigation into Jeffrey Epstein’s Work for Leon Black

Spike Lee’s ‘Highest 2 Lowest’ Features Art From His Own Collection

MacDowell’s Chiwoniso Kaitano Wants to Center Artist Residencies

Naomi Beckwith Reveals Documenta 16’s All-Female Artistic Team

Latest Posts

OpenAI Is Poised to Become the Most Valuable Startup Ever. Should It Be?

August 19, 2025

VB AI Impact Series: Can you really govern multi-agent AI?

August 19, 2025

Meta rolls out AI-powered translations to creators globally, starting with English and Spanish

August 19, 2025

Subscribe to News

Subscribe to our newsletter and never miss our latest news

Subscribe my Newsletter for New Posts & tips Let's stay updated!

Recent Posts

  • OpenAI Is Poised to Become the Most Valuable Startup Ever. Should It Be?
  • VB AI Impact Series: Can you really govern multi-agent AI?
  • Meta rolls out AI-powered translations to creators globally, starting with English and Spanish
  • Tesla Model Y L addresses one huge complaint from many owners
  • AI Rewriting Workforce Planning | Recruiting News Network

Recent Comments

  1. DennisAscek on Foundation AI: Cisco launches AI model for integration in security applications
  2. LloydWocop on 1-800-CHAT-GPT—12 Days of OpenAI: Day 10
  3. Davidcrace on 1-800-CHAT-GPT—12 Days of OpenAI: Day 10
  4. LloydWocop on 1-800-CHAT-GPT—12 Days of OpenAI: Day 10
  5. LloydWocop on 1-800-CHAT-GPT—12 Days of OpenAI: Day 10

Welcome to Advanced AI News—your ultimate destination for the latest advancements, insights, and breakthroughs in artificial intelligence.

At Advanced AI News, we are passionate about keeping you informed on the cutting edge of AI technology, from groundbreaking research to emerging startups, expert insights, and real-world applications. Our mission is to deliver high-quality, up-to-date, and insightful content that empowers AI enthusiasts, professionals, and businesses to stay ahead in this fast-evolving field.

Subscribe to Updates

Subscribe to our newsletter and never miss our latest news

Subscribe my Newsletter for New Posts & tips Let's stay updated!

LinkedIn Instagram YouTube Threads X (Twitter)
  • Home
  • About Us
  • Advertise With Us
  • Contact Us
  • DMCA
  • Privacy Policy
  • Terms & Conditions
© 2025 advancedainews. Designed by advancedainews.

Type above and press Enter to search. Press Esc to cancel.