VideoCanvas: Unified Video Completion From Arbitrary Spatiotemporal Patches Via In-Context Conditioning - Takara TLDR

We introduce the task of arbitrary spatio-temporal video completion, where a
video is generated from arbitrary, user-specified patches placed at any spatial
location and timestamp, akin to painting on a video canvas. This flexible
formulation naturally unifies many existing controllable video generation
tasks–including first-frame image-to-video, inpainting, extension, and
interpolation–under a single, cohesive paradigm. Realizing this vision,
however, faces a fundamental obstacle in modern latent video diffusion models:
the temporal ambiguity introduced by causal VAEs, where multiple pixel frames
are compressed into a single latent representation, making precise frame-level
conditioning structurally difficult. We address this challenge with
VideoCanvas, a novel framework that adapts the In-Context Conditioning (ICC)
paradigm to this fine-grained control task with zero new parameters. We propose
a hybrid conditioning strategy that decouples spatial and temporal control:
spatial placement is handled via zero-padding, while temporal alignment is
achieved through Temporal RoPE Interpolation, which assigns each condition a
continuous fractional position within the latent sequence. This resolves the
VAE’s temporal ambiguity and enables pixel-frame-aware control on a frozen
backbone. To evaluate this new capability, we develop VideoCanvasBench, the
first benchmark for arbitrary spatio-temporal video completion, covering both
intra-scene fidelity and inter-scene creativity. Experiments demonstrate that
VideoCanvas significantly outperforms existing conditioning paradigms,
establishing a new state of the art in flexible and unified video generation.

Source link

What's Hot

Stanford’s Paper2Agent Reimagines Scientific Papers as Interactive AI Agents

R2RGEN: Real-to-Real 3D Data Generation for Spatially Generalized Manipulation – Takara TLDR

Nvidia Can Invest $100 Billion Per Year to Grow a Huge AI Ecosystem

VideoCanvas: Unified Video Completion from Arbitrary Spatiotemporal Patches via In-Context Conditioning – Takara TLDR

R2RGEN: Real-to-Real 3D Data Generation for Spatially Generalized Manipulation – Takara TLDR

ARTDECO: Towards Efficient and High-Fidelity On-the-Fly 3D Reconstruction with Structured Scene Representation – Takara TLDR

DexNDM: Closing the Reality Gap for Dexterous In-Hand Rotation via Joint-Wise Neural Dynamics Model – Takara TLDR

Frieze to Launch Abu Dhabi Fair in November 2026

Jeff Koons Returns to Gagosian with First New York Show in Seven Years

Ancient Egyptian Iconography Found in Roman-Era Bathhouse in Turkey

London Gallery Harlesden High Street Goes to Mayfair For a Pop-up

Stanford’s Paper2Agent Reimagines Scientific Papers as Interactive AI Agents

R2RGEN: Real-to-Real 3D Data Generation for Spatially Generalized Manipulation – Takara TLDR

Nvidia Can Invest $100 Billion Per Year to Grow a Huge AI Ecosystem

What's Hot

VideoCanvas: Unified Video Completion from Arbitrary Spatiotemporal Patches via In-Context Conditioning – Takara TLDR

Related Posts

Subscribe to Updates