Browsing: Hugging Face
Multimodal Large Language Models (MLLMs) excel at simple vision-language tasks but struggle when faced with complex tasks that require multiple…
Data scaling and standardized evaluation benchmarks have driven significant advances in natural language processing and computer vision. However, robotics faces…
Retrieval-Augmented Generation (RAG) has shown substantial promise in improving factual accuracy by grounding model responses with external knowledge relevant to…
We present ReasonIR-8B, the first retriever specifically trained for general reasoning tasks. Existing retrievers have shown limited gains on reasoning…
This paper presents an effective approach for learning novel 4D embodied world models, which predict the dynamic evolution of 3D…
The exposure of large language models (LLMs) to copyrighted material during pre-training raises concerns about unintentional copyright infringement post deployment.…
Abstract: Measuring progress is fundamental to the advancement of any scientific field. As benchmarks play an increasingly central role, they…
Although contemporary text-to-image generation models have achieved remarkable breakthroughs in producing visually appealing images, their capacity to generate precise and…
Mathematical geometric problem solving (GPS) often requires effective integration of multimodal information and verifiable logical coherence. Despite the fast development…
Downsampling layers are crucial building blocks in CNN architectures, which help to increase the receptive field for learning high-level features…