Browsing: Hugging Face
FlowDirector, an inversion-free video editing framework, uses ODEs for spatiotemporal coherent editing and attention-guided masking for localized control, achieving state-of-the-art…
MARBLE utilizes material embeddings in CLIP-space to control pre-trained text-to-image models for blending and recomposing material properties in images with…
A framework combining visual priors and dynamic constraints within a synchronized diffusion process generates HOI video and motion simultaneously, enhancing…
CG-AV-Counting is a new benchmark for video counting tasks that includes multimodal data and supports end-to-end and reasoning-based models. AV-Reasoner,…
MedAgentGYM, a training environment for coding-based medical reasoning in LLMs, enhances performance through supervised fine-tuning and reinforcement learning, providing a…
A framework called Micro-Act addresses Knowledge Conflicts in Retrieval-Augmented Generation by adaptively decomposing knowledge sources, leading to improved QA accuracy…
Rectified Point Flow unifies pairwise point cloud registration and multi-part shape assembly through a continuous point-wise velocity field, achieving state-of-the-art…
Paper page – VideoMathQA: Benchmarking Mathematical Reasoning via Multimodal Understanding in Videos
VideoMathQA evaluates models’ ability to perform temporally extended cross-modal reasoning across various mathematical domains in video settings, addressing direct problem…
A reinforcement learning framework for LLMs enhances contextual integrity by reducing inappropriate information disclosure and maintaining task performance across various…
SkyReels-Audio is a unified framework using pretrained video diffusion transformers for generating high-fidelity and coherent audio-conditioned talking portrait videos, supported…