Paper Page - TF1-EN-3M: Three Million Synthetic Moral Fables For Training Small, Open Language Models

We’ve just released TF1-EN-3M, the largest open corpus of machine-generated moral fables to date — and it was created entirely with models no larger than 8B parameters. 🎉

📄 TF1-EN-3M: Three Million Synthetic Moral Fables for Training Small, Open Language Models))

🌟 Why Another Story Dataset?

Existing collections such as Aesop’s Fables top out at a few hundred examples — far too small for today’s data-hungry models.
Most educational, on-device, or open-source projects can’t deploy 70B-parameter giants.
We asked: Can compact, fully open models (< 8B) generate a massive, high-quality, ethics-focused story corpus that anyone can fine-tune?

📦 What’s Inside TF1-EN-3M?

Feature
Details

Size
3,000,000 English fables (≈ 1B tokens)

Structure
Six-slot scaffold: character → trait → setting → conflict → resolution → moral

Audience
Written for 4–7-year-olds (simple vocabulary, explicit morals)

Metadata
Prompt, model name, token counts, latency, GPU type & cost per story

License
CC-BY-4.0 — free to remix, filter, or extend

👉 Dataset on the Hub: klusai/ds-tf1-en-3m

🤖 One-Paragraph Generation Recipe

A combinatorial engine expands six curated lists (100 options each) into millions of unique prompts.
Ten open-weight instruction models (1B–8B) compete; we score Grammar, Creativity, Moral Clarity, and Prompt Adherence with a gpt-o3-mini critic, plus Self-BLEU & Distinct-1 diversity checks.
LLaMA-3.1-8B-Instruct wins — great quality, tiny VRAM footprint, and costs < $0.0005 per story on an L40S GPU.
All code lives in the public tinyfabulist repo.

🔍 Quick Quality Peek

Mean critic score: 7.8 / 10 (four axes)
Age fit: 80% tagged “Age B” (4–7 yrs)
Diversity: Self-BLEU 0.31 • Distinct-1 0.16

from datasets import load_dataset, disable_caching
disable_caching()
ds = load_dataset(“klusai/ds-tf1-en-3m”, split=“train[:3%]”)
print(ds.shuffle(seed=42)[0][“fable”])

🛠️ What Can You Do With It?

Fine-tune tiny LMs (1–3B) into bedtime-story generators that run on phones or edge devices.
Build moral-inference benchmarks: given a fable, predict its lesson.
Train alignment critics to verify kid-safe morals in generated text.
Translate the prompt lists and spawn French, Hindi, or Swahili mega-fable sets in a weekend GPU sprint.

Paper: The TF1-EN-3M Synthetic Fables Dataset: Large-Scale Story Generation with Small Open Models
Authors: Mihai Nădaș, Laura Dioșan, Andreea Tomescu & Andrei Pișcoran (KlusAI Labs & Babeș-Bolyai University)

Happy storytelling! 🎈

Source link

What's Hot

Competition heats up to challenge Nvidia’s AI chip dominance

Anthropic’s Claude AI can now automatically ‘remember’ past chats

Tencent’s AI model Hunyuan Image 3.0 tops leaderboard, beating Google’s Nano Banana

Paper page – TF1-EN-3M: Three Million Synthetic Moral Fables for Training Small, Open Language Models

FocusAgent: Simple Yet Effective Ways of Trimming the Large Context of Web Agents – Takara TLDR

Improving GUI Grounding with Explicit Position-to-Coordinate Mapping – Takara TLDR

Ovi: Twin Backbone Cross-Modal Fusion for Audio-Video Generation – Takara TLDR

Former ARTnews Publisher Dies at 97

National Gallery of Art Closes as a Result of Government Shutdown

Almine Rech Closes London Gallery After More Than a Decade

Record Exec and Art Collector Gets Over 4 Years