Browsing: Hugging Face
Investigation of modality conflict in multimodal large language models reveals its role in causing hallucinations, with reinforcement learning emerging as…
Foundation models, despite excelling in training tasks, often fail to generalize to new tasks due to task-specific heuristics rather than…
Hierarchical networks replace traditional tokenization pipelines by dynamically learning segmentation strategies, achieving better performance and scalability across various languages and…
The study identifies a linear reasoning bottleneck in Visual-Language Models and proposes the Linear Separability Ceiling as a metric to…
How does an LLM understand the meaning of ‘wRiTe’ when its building blocks—the individual character tokens ‘w’, ‘R’, ‘i’—have no…
PyVision, an interactive framework, enables LLMs to autonomously create and refine Python-based tools for visual reasoning, achieving significant performance improvements…
Videos inherently represent 2D projections of a dynamic 3D world. However, our analysis suggests that video diffusion models trained solely…
Recent advances in multimodal large language models (MLLMs) have shown remarkable capabilities in integrating vision and language for complex reasoning.…
LangSplatV2 enhances 3D text querying speed and accuracy by replacing the heavyweight decoder with a sparse coefficient field and efficient…
We found that the layers of a pretrained large language model (LLM) can be manipulated as separate modules to build…