Paper page - FG-CLIP: Fine-Grained Visual and Textual Alignment

Contrastive Language-Image Pre-training (CLIP) excels in multimodal tasks
such as image-text retrieval and zero-shot classification but struggles with
fine-grained understanding due to its focus on coarse-grained short captions.
To address this, we propose Fine-Grained CLIP (FG-CLIP), which enhances
fine-grained understanding through three key innovations. First, we leverage
large multimodal models to generate 1.6 billion long caption-image pairs for
capturing global-level semantic details. Second, a high-quality dataset is
constructed with 12 million images and 40 million region-specific bounding
boxes aligned with detailed captions to ensure precise, context-rich
representations. Third, 10 million hard fine-grained negative samples are
incorporated to improve the model’s ability to distinguish subtle semantic
differences. Corresponding training methods are meticulously designed for these
data. Extensive experiments demonstrate that FG-CLIP outperforms the original
CLIP and other state-of-the-art methods across various downstream tasks,
including fine-grained understanding, open-vocabulary object detection,
image-text retrieval, and general multimodal benchmarks. These results
highlight FG-CLIP’s effectiveness in capturing fine-grained image details and
improving overall model performance. The related data, code, and models are
available at https://github.com/360CVGroup/FG-CLIP.

Source link

What's Hot

Alibaba launches new Qwen LLMs in China’s latest open-source AI breakthrough – NBC4 Washington

OpenAI and Google outdo the mathletes, but not each other

How To Hide AI Images From Online Searches With DuckDuckGo

Paper page – FG-CLIP: Fine-Grained Visual and Textual Alignment

Paper page – RedOne: Revealing Domain-specific LLM Post-Training in Social Networking Services

Paper page – Mono-InternVL-1.5: Towards Cheaper and Faster Monolithic Multimodal Large Language Models

Paper page – The Devil behind the mask: An emergent safety vulnerability of Diffusion LLMs

Nonprofit Files Case Accusing Russia of Plundering Ukrainian Culture

Artist Raymond Saunders Dies at 90

Famous $6.2 M. Banana from Maurizio Cattelan’s ‘Comedian’ Eaten Again

Trump Accused of Sending Lewd Drawing, And More: Morning Links

Alibaba launches new Qwen LLMs in China’s latest open-source AI breakthrough – NBC4 Washington

OpenAI and Google outdo the mathletes, but not each other

How To Hide AI Images From Online Searches With DuckDuckGo

What's Hot

Paper page – FG-CLIP: Fine-Grained Visual and Textual Alignment

Related Posts

Subscribe to Updates