View a PDF of the paper titled LED: LLM Enhanced Open-Vocabulary Object Detection without Human Curated Data Generation, by Yang Zhou and 5 other authors
View PDF
HTML (experimental)
Abstract:Large foundation models trained on large-scale vision-language data can boost Open-Vocabulary Object Detection (OVD) via synthetic training data, yet the hand-crafted pipelines often introduce bias and overfit to specific prompts. We sidestep this issue by directly fusing hidden states from Large Language Models (LLMs) into detectors-an avenue surprisingly under-explored. This paper presents a systematic method to enhance visual grounding by utilizing decoder layers of the LLM of an MLLM. We introduce a zero-initialized cross-attention adapter to enable efficient knowledge fusion from LLMs to object detectors, a new approach called LED (LLM Enhanced Open-Vocabulary Object Detection). We find that intermediate LLM layers already encode rich spatial semantics; adapting only the early layers yields most of the gain. With Swin-T as the vision encoder, Qwen2-0.5B + LED lifts GroundingDINO by 3.82 % on OmniLabel at just 8.7 % extra GFLOPs, and a larger vision backbone pushes the improvement to 6.22 %. Extensive ablations on adapter variants, LLM scales and fusion depths further corroborate our design.
Submission history
From: Yang Zhou [view email]
[v1]
Tue, 18 Mar 2025 00:50:40 UTC (42,633 KB)
[v2]
Tue, 20 May 2025 14:22:45 UTC (40,446 KB)
[v3]
Thu, 22 May 2025 18:47:26 UTC (40,438 KB)