View a PDF of the paper titled What Is That Talk About? A Video-to-Text Summarization Dataset for Scientific Presentations, by Dongqi Liu and 8 other authors
View PDF
HTML (experimental)
Abstract:Transforming recorded videos into concise and accurate textual summaries is a growing challenge in multimodal learning. This paper introduces VISTA, a dataset specifically designed for video-to-text summarization in scientific domains. VISTA contains 18,599 recorded AI conference presentations paired with their corresponding paper abstracts. We benchmark the performance of state-of-the-art large models and apply a plan-based framework to better capture the structured nature of abstracts. Both human and automated evaluations confirm that explicit planning enhances summary quality and factual consistency. However, a considerable gap remains between models and human performance, highlighting the challenges of our dataset. This study aims to pave the way for future research on scientific video-to-text summarization.
Submission history
From: Dongqi Liu [view email]
[v1]
Wed, 12 Feb 2025 10:36:55 UTC (5,406 KB)
[v2]
Mon, 17 Feb 2025 12:01:02 UTC (5,398 KB)
[v3]
Wed, 26 Feb 2025 13:57:59 UTC (5,406 KB)
[v4]
Sat, 24 May 2025 14:14:01 UTC (4,062 KB)