Browsing: Hugging Face
VideoMolmo is a a large multimodal model tailored for fine-grained spatio-temporal pointing conditioned on textual descriptions. Building upon the Molmo…
A framework, TR2M, uses multimodal inputs to rescale relative depth to metric depth, enhancing performance across various datasets through cross-modality…
CAMS integrates an agentic framework with urban-knowledgeable large language models to simulate human mobility more realistically by modeling individual and…
Ring-lite uses a MoE architecture and reinforcement learning to efficiently match SOTA reasoning models while activating fewer parameters and addressing…
LC-R1, a post-training method guided by Brevity and Sufficiency principles, reduces unnecessary reasoning in Large Reasoning Models with minimal accuracy…
Ambient Diffusion Omni framework leverages low-quality images to enhance diffusion models by utilizing properties of natural images and shows improvements…
Ego-R1, a reinforcement learning-based framework, uses a structured tool-augmented chain-of-thought process to reason over ultra-long egocentric videos, achieving better performance…
With the rapid improvement in the general capabilities of LLMs, LLM personalization, i.e., how tobuild LLM systems that can generate…
Test3R, a test-time learning technique for 3D reconstruction, enhances geometric accuracy by optimizing network consistency using self-supervised learning on image…
The paper presents a method for generating diverse and complex instruction data for large language models using attributed grounding, achieving…