Collaborative Multi-Modal Coding For High-Quality 3D Generation - Takara TLDR

3D content inherently encompasses multi-modal characteristics and can be
projected into different modalities (e.g., RGB images, RGBD, and point clouds).
Each modality exhibits distinct advantages in 3D asset modeling: RGB images
contain vivid 3D textures, whereas point clouds define fine-grained 3D
geometries. However, most existing 3D-native generative architectures either
operate predominantly within single-modality paradigms-thus overlooking the
complementary benefits of multi-modality data-or restrict themselves to 3D
structures, thereby limiting the scope of available training datasets. To
holistically harness multi-modalities for 3D modeling, we present TriMM, the
first feed-forward 3D-native generative model that learns from basic
multi-modalities (e.g., RGB, RGBD, and point cloud). Specifically, 1) TriMM
first introduces collaborative multi-modal coding, which integrates
modality-specific features while preserving their unique representational
strengths. 2) Furthermore, auxiliary 2D and 3D supervision are introduced to
raise the robustness and performance of multi-modal coding. 3) Based on the
embedded multi-modal code, TriMM employs a triplane latent diffusion model to
generate 3D assets of superior quality, enhancing both the texture and the
geometric detail. Extensive experiments on multiple well-known datasets
demonstrate that TriMM, by effectively leveraging multi-modality, achieves
competitive performance with models trained on large-scale datasets, despite
utilizing a small amount of training data. Furthermore, we conduct additional
experiments on recent RGB-D datasets, verifying the feasibility of
incorporating other multi-modal datasets into 3D generation.

Source link

What's Hot

MIT startup Commonwealth Fusion Systems raises $863 million

Forget data labeling: Tencent’s R-Zero shows how LLMs can train themselves

Vocal Image is using AI to help people communicate better

Collaborative Multi-Modal Coding for High-Quality 3D Generation – Takara TLDR

Self-Rewarding Vision-Language Model via Reasoning Decomposition – Takara TLDR

CODA: Coordinating the Cerebrum and Cerebellum for a Dual-Brain Computer Use Agent with Decoupled Reinforcement Learning – Takara TLDR

Discrete Diffusion VLA: Bringing Discrete Diffusion to Action Decoding in Vision-Language-Action Policies – Takara TLDR

London Museum Secures Banksy’s Piranhas

Egyptian Antiquities Trafficker Sentenced to Six Months in Prison

Sotheby’s to Launch First Series of Luxury Auctions in Abu Dhabi

Nazi-Looted Painting Turns Up in Argentinean Real Estate Listing

MIT startup Commonwealth Fusion Systems raises $863 million

Forget data labeling: Tencent’s R-Zero shows how LLMs can train themselves

Vocal Image is using AI to help people communicate better

What's Hot

Collaborative Multi-Modal Coding for High-Quality 3D Generation – Takara TLDR

Related Posts

Subscribe to Updates