Paper Page - FusionAudio-1.2M: Towards Fine-grained Audio Captioning With Multimodal Contextual Fusion

A novel two-stage pipeline using specialized pretrained models and a large language model enhances audio caption quality by integrating diverse multimodal cues and contextual information.

High-quality, large-scale audio captioning is crucial for advancing audio
understanding, yet current automated methods often generate captions that lack
fine-grained detail and contextual accuracy, primarily due to their reliance on
limited unimodal or superficial multimodal information. Drawing inspiration
from human auditory perception, which adeptly integrates cross-modal cues and
performs sophisticated auditory scene analysis, we introduce a novel two-stage
automated pipeline. This pipeline first employs specialized pretrained models
to extract diverse contextual cues (e.g., speech, music, general sounds, and
visual information from associated video). A large language model (LLM) then
synthesizes these rich, multimodal inputs to generate detailed and
context-aware audio captions. Key contributions of this work include: (1) the
proposed scalable method for fine-grained audio caption generation; (2)
FusionAudio, a new large-scale dataset comprising 1.2 million such detailed
captions, combined with 6 million QA pairs; and (3) enhanced audio models
developed using FusionAudio, specifically a CLAP-based audio encoder with
superior audio-text alignment and instruction following. This paper paves the
way for more nuanced and accurate automated understanding of complex audio
environments. Code and data can be found in
https://github.com/satsuki2486441738/FusionAudio.

Source link

What's Hot

Judge lifts order requiring OpenAI to preserve ChatGPT logs

GCPO: When Contrast Fails, Go Gold – Takara TLDR

I’m fed up of AI chatbots replacing customer service

Paper page – FusionAudio-1.2M: Towards Fine-grained Audio Captioning with Multimodal Contextual Fusion

GCPO: When Contrast Fails, Go Gold – Takara TLDR

A^2Search: Ambiguity-Aware Question Answering with Reinforcement Learning – Takara TLDR

Learning on the Job: An Experience-Driven Self-Evolving Agent for Long-Horizon Tasks – Takara TLDR

The Rubin Names 2025 Art Prize, Research and Art Projects Grants

Kochi-Muziris Biennial Announces 66 Artists for December Exhibition

Instagram Launches ‘Rings’ Awards for Creators—With KAWS as a Judge

Museums Prepare to Close Their Doors as Government Shutdown Continues

Judge lifts order requiring OpenAI to preserve ChatGPT logs

GCPO: When Contrast Fails, Go Gold – Takara TLDR

I’m fed up of AI chatbots replacing customer service

What's Hot

Paper page – FusionAudio-1.2M: Towards Fine-grained Audio Captioning with Multimodal Contextual Fusion

Related Posts

Subscribe to Updates