Paper Page - MetaSynth: Meta-Prompting-Driven Agentic Scaffolds For Diverse Synthetic Data Generation

Recent smaller language models such Phi-3.5 and Phi-4 rely on synthetic data generated using larger Language models. Questions remain about leveraging synthetic data for other use cases, such as adapting LLMs to specific domains. A key limitation of synthetic data is low diversity, which negatively impacts its downstream applicability for improving other models. To address this, we propose METASYNTH, a method for generating synthetic data that enhances diversity through meta-prompting, where a language model orchestrates multiple “expert” LLM agents to collaboratively generate data. Using only 25 million tokens of synthetic data generated with METASYNTH, we successfully adapt a well-trained LLM (Mistral- 7B-v0.3) to two specialized domains–Finance and Biomedicine–without compromising the capabilities of the resulting model in general tasks. In addition, we evaluate the diversity of our synthetic data using seven automated metrics, and find that it approaches the diversity of LLM pre-training corpora. Continually pre-training Mistral-7B-v0.3 with METASYNTH notably outperforms the base LLM, showing improvements of up to 4.08% in Finance and 13.75% in Biomedicine. The same model shows degraded performance when
trained on data generated using a template prompt, even when the template includes prior generations and varying In-Context exemplars of real data. Our findings suggest that a few million tokens of diverse synthetic data without mixing any real data, is sufficient for effective domain adaptation when using MetaSynth.

Source link

What's Hot

GeoPQA: Bridging the Visual Perception Gap in MLLMs for Geometric Reasoning – Takara TLDR

OpenAI, Oracle, and SoftBank announced five new AI data centers as part of Stargate.

Scott Wiener on his fight to make Big Tech disclose AI’s dangers

Paper page – MetaSynth: Meta-Prompting-Driven Agentic Scaffolds for Diverse Synthetic Data Generation

GeoPQA: Bridging the Visual Perception Gap in MLLMs for Geometric Reasoning – Takara TLDR

LIMI: Less is More for Agency – Takara TLDR

OmniInsert: Mask-Free Video Insertion of Any Reference via Diffusion Transformer Models – Takara TLDR

Court Rules ‘Gender Ideology’ Ban on Art Endowments Unconstitutional

Rural Danish Art Museum Acquires Painting By Artemisia Gentileschi

Dan Nadel Is Expanding American Art History, One Outlier at a Time

Bernard Arnault Says French Wealth Tax Will ‘Destroy’ the Economy

GeoPQA: Bridging the Visual Perception Gap in MLLMs for Geometric Reasoning – Takara TLDR

OpenAI, Oracle, and SoftBank announced five new AI data centers as part of Stargate.

Scott Wiener on his fight to make Big Tech disclose AI’s dangers

What's Hot

Paper page – MetaSynth: Meta-Prompting-Driven Agentic Scaffolds for Diverse Synthetic Data Generation

Related Posts

Subscribe to Updates