Videos inherently represent 2D projections of a dynamic 3D world. However,
our analysis suggests that video diffusion models trained solely on raw video
data often fail to capture meaningful geometric-aware structure in their
learned representations. To bridge this gap between video diffusion models and
the underlying 3D nature of the physical world, we propose Geometry Forcing, a
simple yet effective method that encourages video diffusion models to
internalize latent 3D representations. Our key insight is to guide the model’s
intermediate representations toward geometry-aware structure by aligning them
with features from a pretrained geometric foundation model. To this end, we
introduce two complementary alignment objectives: Angular Alignment, which
enforces directional consistency via cosine similarity, and Scale Alignment,
which preserves scale-related information by regressing unnormalized geometric
features from normalized diffusion representation. We evaluate Geometry Forcing
on both camera view-conditioned and action-conditioned video generation tasks.
Experimental results demonstrate that our method substantially improves visual
quality and 3D consistency over the baseline methods. Project page:
https://GeometryForcing.github.io.