Browsing: Hugging Face
Single-stream architectures using Vision Transformer (ViT) backbones show great potential for real-time UAV tracking recently. However, frequent occlusions from obstacles…
Recent smaller language models such Phi-3.5 and Phi-4 rely on synthetic data generated using larger Language models. Questions remain about…
We introduce 𝙲𝚘𝚖𝚙𝚕𝚎𝚡-𝙴𝚍𝚒𝚝, a comprehensive benchmark designed to systematically evaluate instruction-based image editing models across instructions of varying complexity. To…
While data synthesis and distillation are promising strategies to enhance small language models, current approaches heavily rely on Large Language…
Large Video Models (LVMs) built upon Large Language Models (LLMs) have shown promise in video understanding but often suffer from…
World simulation has gained increasing popularity due to its ability to model virtual environments and predict the consequences of actions.…
Large Language Models (LLMs) have shown tremendous potential as agents, excelling at tasks that require multiple rounds of reasoning and…
Charts are ubiquitous, as people often use them to analyze data, answer questions, and discover critical insights. However, performing complex…
The success of text-to-image (T2I) generation models has spurred a proliferation of numerous model checkpoints fine-tuned from the same base…
Large language model (LLM) agents are increasingly employing retrieval-augmented generation (RAG) to improve the factuality of their responses. However, in…