📣 C3 Benchmark: The Challenging Benchmark for Bilingual Speech Dialogue Models!
🎙️ C3 is the first-ever benchmark dataset that tests complex phenomena in speech dialogues, covering pauses, homophones, stress, intonation, syntactic ambiguity, coreference, omission, and multi-turn conversations.
📊 With 1,079 real-world scenarios and 1,586 audio-text pairs, it leaves speech dialogue models struggling to keep up!
🔥 Challenge Examples:
“He saw the man / with glasses” vs “He saw / the man with glasses”: Does he wear glasses or the man?
“Mr. Smith loves music more than his wife”: Does it mean “Mr. Smith loves music more than he loves his wife” or “Mr. Smith loves music more than his wife does”?
“Joan made sure to thank Susan for all the help she had received”: Does “she” refer to Joan or Susan?
📈 Evaluation Results (As of July 30, 2025):
Best Model in Chinese: Qwen2.5-Omni (40.08%)
Best Model in English: GPT-4o-Audio-Preview (55.68%)
🔗 Experience C3 Now:
🔥 Limited Time Offer! We can help you run the evaluation script for your SDM’s result on our benchmark, free of charge until Sept. 1, 2025. After that, you can run the evaluation independently. To participate, email chengqianma@yeah.net with subject: [C3Bench Evaluation] – [Model_Name]