Paper Page - Seed1.5-VL Technical Report

We present Seed1.5-VL, a vision-language foundation model designed to advance
general-purpose multimodal understanding and reasoning. Seed1.5-VL is composed
with a 532M-parameter vision encoder and a Mixture-of-Experts (MoE) LLM of 20B
active parameters. Despite its relatively compact architecture, it delivers
strong performance across a wide spectrum of public VLM benchmarks and internal
evaluation suites, achieving the state-of-the-art performance on 38 out of 60
public benchmarks. Moreover, in agent-centric tasks such as GUI control and
gameplay, Seed1.5-VL outperforms leading multimodal systems, including OpenAI
CUA and Claude 3.7. Beyond visual and video understanding, it also demonstrates
strong reasoning abilities, making it particularly effective for multimodal
reasoning challenges such as visual puzzles. We believe these capabilities will
empower broader applications across diverse tasks. In this report, we mainly
provide a comprehensive review of our experiences in building Seed1.5-VL across
model design, data construction, and training at various stages, hoping that
this report can inspire further research. Seed1.5-VL is now accessible at
https://www.volcengine.com/ (Volcano Engine Model ID:
doubao-1-5-thinking-vision-pro-250428)

Source link

What's Hot

Multimodal Prompt Optimization: Why Not Leverage Multiple Modalities for MLLMs – Takara TLDR

Greg Brockman Says OpenAI’s Tech Outpaced Human Chip Designers

PhysToolBench: Benchmarking Physical Tool Understanding for MLLMs – Takara TLDR

Paper page – Seed1.5-VL Technical Report

Multimodal Prompt Optimization: Why Not Leverage Multiple Modalities for MLLMs – Takara TLDR

PhysToolBench: Benchmarking Physical Tool Understanding for MLLMs – Takara TLDR

MRMR: A Realistic and Expert-Level Multidisciplinary Benchmark for Reasoning-Intensive Multimodal Retrieval – Takara TLDR

Egyptian Archaeologists Discover Large New Kingdom Military Fortress

Joan Weinstein to Head Vice President for Getty-Wide Program Planning

India Plots First Venice Biennale Pavilion in Seven Years

Massive Moai Statues Once ‘Walked’ to Their Platforms on Easter Island

Multimodal Prompt Optimization: Why Not Leverage Multiple Modalities for MLLMs – Takara TLDR

Greg Brockman Says OpenAI’s Tech Outpaced Human Chip Designers

PhysToolBench: Benchmarking Physical Tool Understanding for MLLMs – Takara TLDR

What's Hot

Paper page – Seed1.5-VL Technical Report

Related Posts

Subscribe to Updates