Browsing: Hugging Face
Pre-training datasets are typically collected from web content and lack inherent domain divisions. For instance, widely used datasets like Common…
Vision-Language Models (VLMs) excel at visual understanding but often suffer from visual hallucinations, where they generate descriptions of nonexistent objects,…
We introduce Perception Encoder (PE), a state-of-the-art encoder for image and video understanding trained via simple vision-language learning. Traditionally, vision…
Scaling test-time compute has emerged as a key ingredient for enabling large language models (LLMs) to solve difficult problems, but…
Current learning-based subject customization approaches, predominantly relying on U-Net architectures, suffer from limited generalization ability and compromised image quality. Meanwhile,…
Computational color constancy, or white balancing, is a key module in a camera’s image signal processor (ISP) that corrects color…
Vision-language models are integral to computer vision research, yet many high-performing models remain closed-source, obscuring their data, design and training…
Single-stream architectures using Vision Transformer (ViT) backbones show great potential for real-time UAV tracking recently. However, frequent occlusions from obstacles…
Recent smaller language models such Phi-3.5 and Phi-4 rely on synthetic data generated using larger Language models. Questions remain about…
We introduce 𝙲𝚘𝚖𝚙𝚕𝚎𝚡-𝙴𝚍𝚒𝚝, a comprehensive benchmark designed to systematically evaluate instruction-based image editing models across instructions of varying complexity. To…