arXiv AI

[2503.08714] Versatile Multimodal Controls for Expressive Talking Human Animation

By Advanced AI EditorApril 17, 2025No Comments2 Mins Read

[Submitted on 10 Mar 2025 (v1), last revised 16 Apr 2025 (this version, v3)]

View a PDF of the paper titled Versatile Multimodal Controls for Expressive Talking Human Animation, by Zheng Qin and 7 other authors

View PDF
HTML (experimental)

Abstract:In filmmaking, directors typically allow actors to perform freely based on the script before providing specific guidance on how to present key actions. AI-generated content faces similar requirements, where users not only need automatic generation of lip synchronization and basic gestures from audio input but also desire semantically accurate and expressive body movement that can be “directly guided” through text descriptions. Therefore, we present VersaAnimator, a versatile framework that synthesizes expressive talking human videos from arbitrary portrait images. Specifically, we design a motion generator that produces basic rhythmic movements from audio input and supports text-prompt control for specific actions. The generated whole-body 3D motion tokens can animate portraits of various scales, producing talking heads, half-body gestures and even leg movements for whole-body images. Besides, we introduce a multi-modal controlled video diffusion that generates photorealistic videos, where speech signals govern lip synchronization, facial expressions, and head motions while body movements are guided by the 2D poses. Furthermore, we introduce a token2pose translator to smoothly map 3D motion tokens to 2D pose sequences. This design mitigates the stiffness resulting from direct 3D to 2D conversion and enhances the details of the generated body movements. Extensive experiments shows that VersaAnimator synthesizes lip-synced and identity-preserving videos while generating expressive and semantically meaningful whole-body motions.

Submission history

From: Zheng Qin [view email]
[v1]
Mon, 10 Mar 2025 08:38:25 UTC (14,076 KB)
[v2]
Sun, 16 Mar 2025 10:09:52 UTC (14,076 KB)
[v3]
Wed, 16 Apr 2025 02:43:12 UTC (13,142 KB)

Previous ArticleStanford HAI’s annual report highlights rapid adoption and growing accessibility of powerful AI systems

Next Article Collector David Geffen Hits Justin Sun with 100-Page Countersuit

Advanced AI Editor

Leave A Reply