arXiv Link
AnyCap Project is a unified captioning framework, dataset, and benchmark that supports image, audio, and video captioning with controllable styles. It’s fully open-sourced, covering training, evaluation, and benchmarking!
✨ Highlights
🏆 Unified Multi-modal Captioning
A single framework for:
Image Captioning
Audio Captioning
Video Captioning
All under one roof—with support for modality-specific components.
📝 Customizable Captioning
Control the content and style of captions via single user text prompts:
Content: Background, Event, Instance, Action, Instance Appearance, Region and so on
Style: Brief, Detail, Genre, Length, Theme
Supports captions tailored for user needs.
📊 Open Benchmark & Evaluation: AnyCapEval
An industry-level benchmark with:
Modality-specific test sets (image/audio/video)
Content-related metrics
Style-related metrics
Gives rise to improved accuracy and reduced variance in assessment.
🛠️ End-to-End Open Source
Everything you need is included:
✅ Full training data
✅ Model inference pipeline
✅ Evaluation benchmark
All available under a permissive open-source license.
🔗 Get Started
Check out the paper and code:
📄 Paper: arXiv:2507.12841
📦 Code & Models: Github
📬 Contact
For questions, collaborations, or benchmark submissions, please reach out via the paper’s contact email.