Paper Page - VITA-Audio: Fast Interleaved Cross-Modal Token Generation For Efficient Large Speech-Language Model

✨ Highlights

Low Latency. VITA-Audio is the first end-to-end speech model capable of generating audio during the initial forward pass. By utilizing a set of 32 prefill tokens, VITA-Audio reduces the time required to generate the first audio token chunk from 236 ms to just 53 ms.
Fast Inference. VITA-Audio achieves an inference speedup of 3-5x at the 7B parameter scale.
Open Source. VITA-Audio is trained on open-source data only, consisting of 200k hours of publicly available audio.
Strong Performance. VITA-Audio achieves competitive results on ASR, TTS, and SQA benchmarks among cutting-edge models under 7B parameters.

📌 Exhibition

Inference Acceleration

Model inference speed under different inference modes.

Time to Generate the First Audio Segment In Streaming Inference

Generated Audio Case

To be or not to be–to live intensely and richly,
merely to exist, that depends on ourselves. Let widen and intensify our relations.
While we live, let live!

The hair has been so little, don’t think about it, go to bed early, for your hair. Good night!

📈 Experimental Results

Comparison of Spoken Question Answering.

Comparison of Text to Speech.

Comparison of Automatic Speech Recognition.

Effectiveness of Inference Acceleration.

Source link

What's Hot

Hunyuan-MT Technical Report – Takara TLDR

Chips, Politics, and Europe’s AI Ambitions

Alibaba Unveils Trillion-Parameter Qwen AI Model

Paper page – VITA-Audio: Fast Interleaved Cross-Modal Token Generation for Efficient Large Speech-Language Model

Hunyuan-MT Technical Report – Takara TLDR

3D and 4D World Modeling: A Survey – Takara TLDR

EnvX: Agentize Everything with Agentic AI – Takara TLDR

National Gallery and Tate Have ‘Bad Blood’—and More Art News

Christie’s Will Auction The First Calculating Machine In History

The Art Market Isn’t Dying. The Way We Write About It Might Be.

Banksy Mural of Judge Beating Protestor Removed by Courts Service

Hunyuan-MT Technical Report – Takara TLDR

Chips, Politics, and Europe’s AI Ambitions

Alibaba Unveils Trillion-Parameter Qwen AI Model

What's Hot

Paper page – VITA-Audio: Fast Interleaved Cross-Modal Token Generation for Efficient Large Speech-Language Model

✨ Highlights

📌 Exhibition

Inference Acceleration

Time to Generate the First Audio Segment In Streaming Inference

Generated Audio Case

📈 Experimental Results

Related Posts

Subscribe to Updates