AerialVG: A Challenging Benchmark For Aerial Visual Grounding By Exploring Positional Relations

arXiv:2504.07836v1 Announce Type: cross
Abstract: Visual grounding (VG) aims to localize target objects in an image based on natural language descriptions. In this paper, we propose AerialVG, a new task focusing on visual grounding from aerial views. Compared to traditional VG, AerialVG poses new challenges, \emph{e.g.}, appearance-based grounding is insufficient to distinguish among multiple visually similar objects, and positional relations should be emphasized. Besides, existing VG models struggle when applied to aerial imagery, where high-resolution images cause significant difficulties. To address these challenges, we introduce the first AerialVG dataset, consisting of 5K real-world aerial images, 50K manually annotated descriptions, and 103K objects. Particularly, each annotation in AerialVG dataset contains multiple target objects annotated with relative spatial relations, requiring models to perform comprehensive spatial reasoning. Furthermore, we propose an innovative model especially for the AerialVG task, where a Hierarchical Cross-Attention is devised to focus on target regions, and a Relation-Aware Grounding module is designed to infer positional relations. Experimental results validate the effectiveness of our dataset and method, highlighting the importance of spatial reasoning in aerial visual grounding. The code and dataset will be released.

Source link

What's Hot

A Vision-Language-Action-Critic Model for Robotic Real-World Reinforcement Learning – Takara TLDR

DeepSeek reports shockingly low training costs for R1 in new paper

Abu Dhabi’s TII and NVIDIA Launch Middle East’s First Joint ‘AI & Robotics’ NVAITC Research Lab

AerialVG: A Challenging Benchmark for Aerial Visual Grounding by Exploring Positional Relations

LTLCrit: A Temporal Logic-based LLM Critic for Safe and Efficient Embodied Agents

From Imitation to Innovation: The Emergence of AI Unique Artistic Styles and the Challenge of Copyright Protection

VerifyLLM: LLM-Based Pre-Execution Task Plan Verification for Robots

New Collectors Drive Strong Sales at New York Fair

Hidden Portrait May Be Vermeer’s Earliest Known Work

Who Are the Art World Figures on the Time 100 List?

Acquavella Signs Harumi Klossowska de Rola, Daughter of Balthus

A Vision-Language-Action-Critic Model for Robotic Real-World Reinforcement Learning – Takara TLDR

DeepSeek reports shockingly low training costs for R1 in new paper

Abu Dhabi’s TII and NVIDIA Launch Middle East’s First Joint ‘AI & Robotics’ NVAITC Research Lab

What's Hot

AerialVG: A Challenging Benchmark for Aerial Visual Grounding by Exploring Positional Relations

Related Posts

Subscribe to Updates