We present 4DNeX, the first feed-forward framework for generating 4D (i.e.,
dynamic 3D) scene representations from a single image. In contrast to existing
methods that rely on computationally intensive optimization or require
multi-frame video inputs, 4DNeX enables efficient, end-to-end image-to-4D
generation by fine-tuning a pretrained video diffusion model. Specifically, 1)
to alleviate the scarcity of 4D data, we construct 4DNeX-10M, a large-scale
dataset with high-quality 4D annotations generated using advanced
reconstruction approaches. 2) we introduce a unified 6D video representation
that jointly models RGB and XYZ sequences, facilitating structured learning of
both appearance and geometry. 3) we propose a set of simple yet effective
adaptation strategies to repurpose pretrained video diffusion models for 4D
modeling. 4DNeX produces high-quality dynamic point clouds that enable
novel-view video synthesis. Extensive experiments demonstrate that 4DNeX
outperforms existing 4D generation methods in efficiency and generalizability,
offering a scalable solution for image-to-4D modeling and laying the foundation
for generative 4D world models that simulate dynamic scene evolution.