Open Data Synthesis For Deep Research - Takara TLDR

Large language models (LLMs) are increasingly expected to go beyond simple
factual queries toward Deep Research-tasks that require decomposing questions
into sub-problems, coordinating multi-step reasoning, and synthesizing evidence
from diverse sources. We formalize Deep Research tasks with verifiable answers
as Hierarchical Constraint Satisfaction Problems (HCSPs), which are
fundamentally different from single-constraint, multi-hop, or flat CSP
formulations. However, existing benchmarks (e.g., Natural Questions, HotpotQA)
fail to capture this complexity, while recent synthetic datasets often
introduce shortcut reasoning, knowledge leakage, or lack sufficient structural
depth. To address this gap, we introduce InfoSeek, a scalable framework for
synthesizing complex Deep Research tasks. InfoSeek uses a dual-agent system to
recursively build a Research Tree from large-scale webpages, blurring
intermediate nodes into valid sub-problems, and converting these trees into
natural language questions that require traversing the full hierarchy. It also
enables rapid scaling, yielding over 50K training examples, a curated test set,
and reasoning trajectories generated via reject sampling. Experiments show that
models trained on InfoSeek consistently outperform strong baselines. On a
challenging benchmark BrowseComp-Plus, 3B LLMs optimized with InfoSeek surpass
much larger 32B models and lightweight commercial APIs (e.g., Gemini2.5-Flash),
while achieving performance comparable to stronger APIs (e.g., Gemini2.5-Pro).
By preserving meta-information such as intermediate steps and retrieval labels,
InfoSeek further supports advanced optimization strategies, including compound
reward design and trajectory-level exploration. We provide our codes and
datasets in \href{https://github.com/VectorSpaceLab/InfoSeek}{this repository}.

Source link

What's Hot

Black Tech Street partners with NVIDIA to bring AI revolution to Tulsa

Tesla deploys Unsupervised FSD in Europe for the first time—with a twist

AI Sector In Q2 2025 Sees Record M&A, Surging Valuations, Rise Of AI Agents : Research

Open Data Synthesis For Deep Research – Takara TLDR

Robix: A Unified Model for Robot Interaction, Reasoning and Planning – Takara TLDR

Implicit Actor Critic Coupling via a Supervised Learning Framework for RLVR – Takara TLDR

M3Ret: Unleashing Zero-shot Multimodal Medical Image Retrieval via Self-Supervision – Takara TLDR

Nazi-Looted Painting from Argentine Home May Have Been Recovered

Moche Residence Unearthed at Archaeological Site in Northern Peru

Kim Sajet to Helm the Milwaukee Art Museum

GalaxyCon LLC Announces Sweeping AI Art Ban

Black Tech Street partners with NVIDIA to bring AI revolution to Tulsa

Tesla deploys Unsupervised FSD in Europe for the first time—with a twist

AI Sector In Q2 2025 Sees Record M&A, Surging Valuations, Rise Of AI Agents : Research

What's Hot

Open Data Synthesis For Deep Research – Takara TLDR

Related Posts

Subscribe to Updates