HumanSense: From Multimodal Perception To Empathetic Context-Aware Responses Through Reasoning MLLMs - Takara TLDR

While Multimodal Large Language Models (MLLMs) show immense promise for
achieving truly human-like interactions, progress is hindered by the lack of
fine-grained evaluation frameworks for human-centered scenarios, encompassing
both the understanding of complex human intentions and the provision of
empathetic, context-aware responses. Here we introduce HumanSense, a
comprehensive benchmark designed to evaluate the human-centered perception and
interaction capabilities of MLLMs, with a particular focus on deep
understanding of extended multimodal contexts and the formulation of rational
feedback. Our evaluation reveals that leading MLLMs still have considerable
room for improvement, particularly for advanced interaction-oriented tasks.
Supplementing visual input with audio and text information yields substantial
improvements, and Omni-modal models show advantages on these tasks.
Furthermore, we argue that appropriate feedback stems from a contextual
analysis of the interlocutor’s needs and emotions, with reasoning ability
serving as the key to unlocking it. Accordingly, we employ a multi-stage,
modality-progressive reinforcement learning to enhance the reasoning abilities
of an Omni model, achieving substantial gains on evaluation results.
Additionally, we observe that successful reasoning processes exhibit highly
consistent thought patterns. By designing corresponding prompts, we also
enhance the performance of non-reasoning models in a training-free manner.
Project page:
\textcolor{brightpink}https://digital-avatar.github.io/ai/HumanSense/

Source link

What's Hot

Is Perplexity’s Comet browser the next big challenger to Chrome?

VLA-R1: Enhancing Reasoning in Vision-Language-Action Models – Takara TLDR

Mapping shifts in the geography of tech innovation: China becomes a big player in AI research

HumanSense: From Multimodal Perception to Empathetic Context-Aware Responses through Reasoning MLLMs – Takara TLDR

VLA-R1: Enhancing Reasoning in Vision-Language-Action Models – Takara TLDR

Just Do It!? Computer-Use Agents Exhibit Blind Goal-Directedness – Takara TLDR

MedQ-Bench: Evaluating and Exploring Medical Image Quality Assessment Abilities in MLLMs – Takara TLDR

Former ARTnews Publisher Dies at 97

Record Exec and Art Collector Gets Over 4 Years

Chicago’s Art Scene Offers a Beacon of Hope for Artists and Dealers

Pace to Close Hong Kong Gallery at H Queen’s This Month

Is Perplexity’s Comet browser the next big challenger to Chrome?

VLA-R1: Enhancing Reasoning in Vision-Language-Action Models – Takara TLDR

Mapping shifts in the geography of tech innovation: China becomes a big player in AI research

What's Hot

HumanSense: From Multimodal Perception to Empathetic Context-Aware Responses through Reasoning MLLMs – Takara TLDR

Related Posts

Subscribe to Updates