The human ability to seamlessly perform multimodal reasoning and physical
interaction in the open world is a core goal for general-purpose embodied
intelligent systems. Recent vision-language-action (VLA) models, which are
co-trained on large-scale robot and visual-text data, have demonstrated notable
progress in general robot control. However, they still fail to achieve
human-level flexibility in interleaved reasoning and interaction. In this work,
introduce EO-Robotics, consists of EO-1 model and EO-Data1.5M dataset. EO-1 is
a unified embodied foundation model that achieves superior performance in
multimodal embodied reasoning and robot control through interleaved
vision-text-action pre-training. The development of EO-1 is based on two key
pillars: (i) a unified architecture that processes multimodal inputs
indiscriminately (image, text, video, and action), and (ii) a massive,
high-quality multimodal embodied reasoning dataset, EO-Data1.5M, which contains
over 1.5 million samples with emphasis on interleaved vision-text-action
comprehension. EO-1 is trained through synergies between auto-regressive
decoding and flow matching denoising on EO-Data1.5M, enabling seamless robot
action generation and multimodal embodied reasoning. Extensive experiments
demonstrate the effectiveness of interleaved vision-text-action learning for
open-world understanding and generalization, validated through a variety of
long-horizon, dexterous manipulation tasks across multiple embodiments. This
paper details the architecture of EO-1, the data construction strategy of
EO-Data1.5M, and the training methodology, offering valuable insights for
developing advanced embodied foundation models.