World simulation has gained increasing popularity due to its ability to model
virtual environments and predict the consequences of actions. However, the
limited temporal context window often leads to failures in maintaining
long-term consistency, particularly in preserving 3D spatial consistency. In
this work, we present WorldMem, a framework that enhances scene generation with
a memory bank consisting of memory units that store memory frames and states
(e.g., poses and timestamps). By employing a memory attention mechanism that
effectively extracts relevant information from these memory frames based on
their states, our method is capable of accurately reconstructing previously
observed scenes, even under significant viewpoint or temporal gaps.
Furthermore, by incorporating timestamps into the states, our framework not
only models a static world but also captures its dynamic evolution over time,
enabling both perception and interaction within the simulated world. Extensive
experiments in both virtual and real scenarios validate the effectiveness of
our approach.