Existing video generation models excel at producing photo-realistic videos
from text or images, but often lack physical plausibility and 3D
controllability. To overcome these limitations, we introduce PhysCtrl, a novel
framework for physics-grounded image-to-video generation with physical
parameters and force control. At its core is a generative physics network that
learns the distribution of physical dynamics across four materials (elastic,
sand, plasticine, and rigid) via a diffusion model conditioned on physics
parameters and applied forces. We represent physical dynamics as 3D point
trajectories and train on a large-scale synthetic dataset of 550K animations
generated by physics simulators. We enhance the diffusion model with a novel
spatiotemporal attention block that emulates particle interactions and
incorporates physics-based constraints during training to enforce physical
plausibility. Experiments show that PhysCtrl generates realistic,
physics-grounded motion trajectories which, when used to drive image-to-video
models, yield high-fidelity, controllable videos that outperform existing
methods in both visual quality and physical plausibility. Project Page:
https://cwchenwang.github.io/physctrl