Omnimodal Referring Audio-Visual Segmentation (OmniAVS) and Omnimodal Instructed Segmentation Assistant (OISA) advance audio-visual segmentation by integrating complex multimodal expressions and leveraging MLLM for reasoning.
Referring audio-visual segmentation (RAVS) has recently seen significant
advancements, yet challenges remain in integrating multimodal information and
deeply understanding and reasoning about audiovisual content. To extend the
boundaries of RAVS and facilitate future research in this field, we propose
Omnimodal Referring Audio-Visual Segmentation (OmniAVS), a new dataset
containing 2,098 videos and 59,458 multimodal referring expressions. OmniAVS
stands out with three key innovations: (1) 8 types of multimodal expressions
that flexibly combine text, speech, sound, and visual cues; (2) an emphasis on
understanding audio content beyond just detecting their presence; and (3) the
inclusion of complex reasoning and world knowledge in expressions. Furthermore,
we introduce Omnimodal Instructed Segmentation Assistant (OISA), to address the
challenges of multimodal reasoning and fine-grained understanding of
audiovisual content in OmniAVS. OISA uses MLLM to comprehend complex cues and
perform reasoning-based segmentation. Extensive experiments show that OISA
outperforms existing methods on OmniAVS and achieves competitive results on
other related tasks.