Paper Page - VITA-Audio: Fast Interleaved Cross-Modal Token Generation For Efficient Large Speech-Language Model

✨ Highlights

Low Latency. VITA-Audio is the first end-to-end speech model capable of generating audio during the initial forward pass. By utilizing a set of 32 prefill tokens, VITA-Audio reduces the time required to generate the first audio token chunk from 236 ms to just 53 ms.
Fast Inference. VITA-Audio achieves an inference speedup of 3-5x at the 7B parameter scale.
Open Source. VITA-Audio is trained on open-source data only, consisting of 200k hours of publicly available audio.
Strong Performance. VITA-Audio achieves competitive results on ASR, TTS, and SQA benchmarks among cutting-edge models under 7B parameters.

📌 Exhibition

Inference Acceleration

Model inference speed under different inference modes.

Time to Generate the First Audio Segment In Streaming Inference

Generated Audio Case

To be or not to be–to live intensely and richly,
merely to exist, that depends on ourselves. Let widen and intensify our relations.
While we live, let live!

The hair has been so little, don’t think about it, go to bed early, for your hair. Good night!

📈 Experimental Results

Comparison of Spoken Question Answering.

Comparison of Text to Speech.

Comparison of Automatic Speech Recognition.

Effectiveness of Inference Acceleration.

Source link

What's Hot

FTC launches inquiry into AI chatbot companions from Meta, OpenAI, and others

Ted Cruz AI bill could let firms bribe Trump to avoid safety laws, critics warn

AgentGym-RL: Training LLM Agents for Long-Horizon Decision Making through Multi-Turn Reinforcement Learning – Takara TLDR

Paper page – VITA-Audio: Fast Interleaved Cross-Modal Token Generation for Efficient Large Speech-Language Model

AgentGym-RL: Training LLM Agents for Long-Horizon Decision Making through Multi-Turn Reinforcement Learning – Takara TLDR

Hunyuan-MT Technical Report – Takara TLDR

3D and 4D World Modeling: A Survey – Takara TLDR

Sally Mann Says Her Black Men Photos Are ‘Problematic’ in Hindsight

National Gallery and Tate Have ‘Bad Blood’—and More Art News

Christie’s Will Auction The First Calculating Machine In History

The Art Market Isn’t Dying. The Way We Write About It Might Be.

FTC launches inquiry into AI chatbot companions from Meta, OpenAI, and others

Ted Cruz AI bill could let firms bribe Trump to avoid safety laws, critics warn

AgentGym-RL: Training LLM Agents for Long-Horizon Decision Making through Multi-Turn Reinforcement Learning – Takara TLDR

What's Hot

Paper page – VITA-Audio: Fast Interleaved Cross-Modal Token Generation for Efficient Large Speech-Language Model

✨ Highlights

📌 Exhibition

Inference Acceleration

Time to Generate the First Audio Segment In Streaming Inference

Generated Audio Case

📈 Experimental Results

Related Posts

Subscribe to Updates