Recent advances in foundation models highlight a clear trend toward
unification and scaling, showing emergent capabilities across diverse domains.
While image generation and editing have rapidly transitioned from task-specific
to unified frameworks, video generation and editing remain fragmented due to
architectural limitations and data scarcity. In this work, we introduce
EditVerse, a unified framework for image and video generation and editing
within a single model. By representing all modalities, i.e., text, image, and
video, as a unified token sequence, EditVerse leverages self-attention to
achieve robust in-context learning, natural cross-modal knowledge transfer, and
flexible handling of inputs and outputs with arbitrary resolutions and
durations. To address the lack of video editing training data, we design a
scalable data pipeline that curates 232K video editing samples and combines
them with large-scale image and video datasets for joint training. Furthermore,
we present EditVerseBench, the first benchmark for instruction-based video
editing covering diverse tasks and resolutions. Extensive experiments and user
studies demonstrate that EditVerse achieves state-of-the-art performance,
surpassing existing open-source and commercial models, while exhibiting
emergent editing and generation abilities across modalities.