Browsing: Hugging Face
RoPECraft is a training-free method that modifies rotary positional embeddings in diffusion transformers to transfer motion from reference videos, enhancing…
Date Fragments: A Hidden Bottleneck of Tokenization for Temporal Reasoning Modern BPE tokenizers often split calendar dates into meaningless fragments,…
An enhanced multimodal language model incorporates thinking process rewards to improve reasoning and generalization, achieving superior performance on benchmarks compared…
Project Page: https://haoningwu3639.github.io/SpatialScore/Paper: https://arxiv.org/abs/2505.17012/Code: https://github.com/haoningwu3639/SpatialScore/Data: https://huggingface.co/datasets/haoningwu/SpatialScore We are currently organizing our data and code, and expect to open-source them within…
A benchmark called VideoGameQA-Bench is introduced to assess Vision-Language Models in video game quality assurance tasks. With video games now…
A novel method called GRIT enhances visual reasoning in MLLMs by generating reasoning chains that integrate both natural language and…
SafeKey enhances the safety of large reasoning models by focusing on activating a safety aha moment in the key sentence…
The FRANK Model enhances multimodal LLMs with reasoning and reflection abilities without retraining, using a hierarchical weight merging approach that…
Robo2VLM, a framework for generating Visual Question Answering datasets using robot trajectory data, enhances and evaluates Vision-Language Models by leveraging…
Large vision-language models (LVLMs) remain vulnerable to hallucination, often generating content misaligned with visual inputs. While recent approaches advance multi-modal…