Ke-Omni-R :通过思考实现高级音频推理

Github:https://github.com/shuaijiang/Ke-Omni-R 【开源训练和推理代码】

贡献:用于将GRPO/思考过程 加入到语音大模型的强化训练过程中。

  • [1] Xie, Zhifei, et al. “Audio-Reasoner: Improving Reasoning Capability in Large Audio Language Models.” arXiv preprint arXiv:2503.02318.
  • [2] Ma, Ziyang, et al. “Audio-CoT: Exploring Chain-of-Thought Reasoning in Large Audio Language Model.” arXiv preprint arXiv:2501.07246.
  • [3] Li, Gang, et al. “Reinforcement Learning Outperforms Supervised Fine-Tuning: A Case Study on Audio Question Answering.” arXiv preprint arXiv:2503.11197
  • [4] Xu, Jin, et al. “Qwen2.5-Omni Technical Report.” arXiv preprint arXiv:2503.20215

Ke-Omni-R 是基于 Qwen2.5-Omni 构建的高级音频推理模型。构建音频推理模型,通过强化学习引入深度思考过程,提升复杂任务的理解和推理能力。仅使用 10,000 个训练后样本,Ke-Omni-R 就在 MMAU Test-mini 和 Test 基准测试中取得了最佳性能。其开发过程中的关键洞察包括:

  • GRPO 算法 :GRPO 算法显著增强了已经很强大的基础模型(Qwen2.5-Omni-7B)的性能,即使在看不见的语音领域也表现出卓越的泛化能力。
  • 思考过程 :融入简洁的思考过程(少于 50 个字)对于提高推理能力起着至关重要的作用。
  • KL 散度 :通过利用 KL 散度,在 GRPO 训练期间观察到轻微的改进。
  • 领域比例 vs. 数据量 :领域多样性比数据量更重要。我们仅使用了 10,000 个样本,其中 5,000 个从 AVQA 中随机选取,另外 5,000 个从 MusicBench 中选取。

Performance: Accuracies (%)↑ on MMAU Test-mini and Test benchmark

ModelMethodSound (Test-mini)Sound (Test)Music (Test-mini)Music (Test)Speech (Test-mini)Speech (Test)Average (Test-mini)Average (Test)
Human*86.3178.2282.1782.23
Gemini Pro 2.0 FlashDirect Inference*56.4661.7358.6856.5351.6561.5355.6059.93
Audio Flamingo 2Direct Inference*61.5665.1073.9572.9030.9340.2655.4859.42
GPT4o + Strong Cap.Direct Inference*57.3555.8349.7051.7364.8668.6657.3058.74
Llama-3-8B-Instruct + Strong Cap.Direct Inference*50.7549.1048.9348.9355.2562.7052.1053.57
Qwen2-Audio-7B-InstructDirect Inference*54.9545.9050.9853.2642.0445.9049.2052.50
SALAMONNDirect Inference*41.0040.3034.8033.7625.5024.2433.7032.77
Audio-Reasoner(Qwen2-Audio-7B-Instruct)[1]60.0664.3060.7061.71
Audio-Cot(Qwen2-Audio-7B-Instruct)[2]61.8656.2955.2657.80
R1-AQA(Qwen2-Audio-7B-Instruct)[3]68.7769.7664.3761.4063.6662.7065.6064.36
Qwen2.5-Omni-7B[4]67.8769.1659.7665.60
Qwen2.5-Omni-3B[4]70.2760.4859.1663.30
Ke-Omni-R-3B(Qwen2.5-Omni-3B)GRPO w/ think (ours)72.3771.8765.5759.6064.2664.1767.4065.17
Ke-Omni-R(Qwen2.5-Omni-7B)GRPO w/o think (ours)69.6770.5767.6664.0066.3767.1767.9067.24
Ke-Omni-R(Qwen2.5-Omni-7B)GRPO w/ think (ours)69.3771.9069.4667.1367.8767.1068.9068.71

Performance: CER/WER (%)↓ on ASR benchmark

ModelMethodWenetSpeech test-netWenetSpeech test-meetingLibriSpeech test-cleanLibriSpeech test-other
Qwen2.5-Omni-3B[4]6.38.12.24.5
Qwen2.5-Omni-7B[4]5.97.71.83.4
Ke-Omni-3Bours11.716.11.83.8
Ke-Omni-7Bours7.59.81.63.1

发表评论

您的电子邮箱地址不会被公开。 必填项已用*标注