✨ Highlights
Low Latency. VITA-Audio is the first end-to-end speech model capable of generating audio during the initial forward pass. By utilizing a set of 32 prefill tokens, VITA-Audio reduces the time required to generate the first audio token chunk from 236 ms to just 53 ms.
Fast Inference. VITA-Audio achieves an inference speedup of 3-5x at the 7B parameter scale.
Open Source. VITA-Audio is trained on open-source data only, consisting of 200k hours of publicly available audio.
Strong Performance. VITA-Audio achieves competitive results on ASR, TTS, and SQA benchmarks among cutting-edge models under 7B parameters.
📌 Exhibition
Inference Acceleration
Model inference speed under different inference modes.
Time to Generate the First Audio Segment In Streaming Inference
Generated Audio Case
To be or not to be–to live intensely and richly,
merely to exist, that depends on ourselves. Let widen and intensify our relations.
While we live, let live!
The hair has been so little, don’t think about it, go to bed early, for your hair. Good night!
📈 Experimental Results
Comparison of Spoken Question Answering.
Comparison of Text to Speech.
Comparison of Automatic Speech Recognition.
Effectiveness of Inference Acceleration.