Paper page - SAKURA: On the Multi-hop Reasoning of Large Audio-Language Models Based on Speech and Audio Information

SAKURA is introduced to evaluate the multi-hop reasoning abilities of large audio-language models, revealing their struggles in integrating speech/audio representations.

Large audio-language models (LALMs) extend the large language models with
multimodal understanding in speech, audio, etc. While their performances on
speech and audio-processing tasks are extensively studied, their reasoning
abilities remain underexplored. Particularly, their multi-hop reasoning, the
ability to recall and integrate multiple facts, lacks systematic evaluation.
Existing benchmarks focus on general speech and audio-processing tasks,
conversational abilities, and fairness but overlook this aspect. To bridge this
gap, we introduce SAKURA, a benchmark assessing LALMs’ multi-hop reasoning
based on speech and audio information. Results show that LALMs struggle to
integrate speech/audio representations for multi-hop reasoning, even when they
extract the relevant information correctly, highlighting a fundamental
challenge in multimodal reasoning. Our findings expose a critical limitation in
LALMs, offering insights and resources for future research.

Source link

What's Hot

Building voice AI that listens to everyone: Transfer learning and synthetic speech in action

A United Nations research institute created an AI refugee avatar

TU Wien Rendering #2 – Radiometry Recap, Light Attenuation

Paper page – SAKURA: On the Multi-hop Reasoning of Large Audio-Language Models Based on Speech and Audio Information

Paper page – Dynamic Chunking for End-to-End Hierarchical Sequence Modeling

Paper page – Beyond the Linear Separability Ceiling

Paper page – Growing Transformers: Modular Composition and Layer-wise Expansion on a Frozen Substrate

Homeland Security Targets Chicago’s National Museum of Puerto Rican Arts & Culture

1,600-Year-Old Tomb of Mayan City’s Founding King Discovered in Belize

Centre Pompidou Cancels Caribbean Art Show, Raising Controversy

‘Night at the Museum’ Reboot in the Works

Building voice AI that listens to everyone: Transfer learning and synthetic speech in action

A United Nations research institute created an AI refugee avatar

TU Wien Rendering #2 – Radiometry Recap, Light Attenuation

What's Hot

Paper page – SAKURA: On the Multi-hop Reasoning of Large Audio-Language Models Based on Speech and Audio Information

Related Posts

Subscribe to Updates