Spine disorders affect 619 million people globally and are a leading cause of
disability, yet AI-assisted diagnosis remains limited by the lack of
level-aware, multimodal datasets. Clinical decision-making for spine disorders
requires sophisticated reasoning across X-ray, CT, and MRI at specific
vertebral levels. However, progress has been constrained by the absence of
traceable, clinically-grounded instruction data and standardized,
spine-specific benchmarks. To address this, we introduce SpineMed, an ecosystem
co-designed with practicing spine surgeons. It features SpineMed-450k, the
first large-scale dataset explicitly designed for vertebral-level reasoning
across imaging modalities with over 450,000 instruction instances, and
SpineBench, a clinically-grounded evaluation framework. SpineMed-450k is
curated from diverse sources, including textbooks, guidelines, open datasets,
and ~1,000 de-identified hospital cases, using a clinician-in-the-loop pipeline
with a two-stage LLM generation method (draft and revision) to ensure
high-quality, traceable data for question-answering, multi-turn consultations,
and report generation. SpineBench evaluates models on clinically salient axes,
including level identification, pathology assessment, and surgical planning.
Our comprehensive evaluation of several recently advanced large vision-language
models (LVLMs) on SpineBench reveals systematic weaknesses in fine-grained,
level-specific reasoning. In contrast, our model fine-tuned on SpineMed-450k
demonstrates consistent and significant improvements across all tasks.
Clinician assessments confirm the diagnostic clarity and practical utility of
our model’s outputs.