Browsing: Hugging Face
Manipulation has long been a challenging task for robots, while humans can effortlessly perform complex interactions with objects, such as…
[Despite rapid advances in vision-language models (VLMs), current benchmarks for multimodal reasoning fall short in three key dimensions. First, they…
Rapid Large Language Model (LLM) advancements are fueling autonomous Multi-Agent System (MAS) development. However, current frameworks often lack flexibility, resource…
STARFlow, a generative model combining normalizing flows with autoregressive Transformers, achieves competitive image synthesis performance with innovations in architecture and…
A new framework enhances video world models’ long-term consistency by integrating a geometry-grounded long-term spatial memory mechanism. Emerging world models…
RoboRefer, a 3D-aware vision language model, enhances spatial understanding and multi-step reasoning in embodied robots through supervised and reinforcement fine-tuning,…
VideoREPA enhances text-to-video synthesis by aligning token-level relations and distilling physics understanding from foundation models into T2V models. Recent advancements…
A novel framework using flow-based generative models aligns learnable latent spaces to target distributions, reducing computational expense and improving log-likelihood…
The Qwen3 Embedding series, built on Qwen3 foundation models, offers advanced text embedding and reranking capabilities through a multi-stage training…
Abstract: Inference-time scaling trades efficiency for increased reasoning accuracy by generating longer or more parallel sequences. However, in Transformer LLMs,…