Browsing: Hugging Face
Vision language models (VLMs) are expected to perform effective multimodal reasoning and make logically coherent decisions, which is critical to…
VAU-R1 uses Multimodal Large Language Models with Reinforcement Fine-Tuning to enhance video anomaly reasoning, complemented by VAU-Bench, a Chain-of-Thought benchmark…
Abstract Recently, the powerful text-to-image capabilities of ChatGPT-4o have led to growing appreciation for native multimodal large language models. However,…
SATA-BENCH evaluates LLMs on multi-answer questions, revealing selections biases and proposing Choice Funnel to improve accuracy and reduce costs in…
📃Existing large language models (LLMs) face challenges of following complex instructions, especially when multiple constraints are present and organized in…
Recent advances in text-to-video diffusion models have enabled high-quality video synthesis, but controllable generation remains challenging, particularly under limited data…
Rule-based reinforcement learning applied to multimodal large language models demonstrates effective generalization in visual tasks, particularly using jigsaw puzzles, outperforming…
A unified vision language action framework, LoHoVLA, combines a large pretrained vision language model with hierarchical closed-loop control to improve…
MagiCodec, a Transformer-based audio codec, enhances semantic tokenization while maintaining high reconstruction quality, improving compatibility with generative models. Neural audio…
Prolonged reinforcement learning training (ProRL) uncovers novel reasoning strategies in language models, outperforming base models and suggesting meaningful expansion of…