{"id":22161,"date":"2024-11-25T20:12:07","date_gmt":"2024-11-25T12:12:07","guid":{"rendered":"http:\/\/139.9.1.231\/?p=22161"},"modified":"2025-06-19T10:02:35","modified_gmt":"2025-06-19T02:02:35","slug":"awesome-speech-lm","status":"publish","type":"post","link":"http:\/\/139.9.1.231\/index.php\/2024\/11\/25\/awesome-speech-lm\/","title":{"rendered":"Awesome Speech LM Survey-\u8bed\u97f3\u5927\u6a21\u578b\u7efc\u8ff0"},"content":{"rendered":"\n<ul><li><strong><em>GitHub\uff1a<a rel=\"noreferrer noopener\" href=\"https:\/\/github.com\/ga642381\/speech-trident\/tree\/master\" target=\"_blank\">https:\/\/github.com\/ga642381\/speech-trident\/tree\/master<\/a><\/em><\/strong><\/li><li><strong><em>GitHub\uff1a<\/em><\/strong><a href=\"https:\/\/github.com\/dreamtheater123\/Awesome-SpeechLM-Survey?tab=readme-ov-file\"><strong><em>https:\/\/github.com\/dreamtheater123\/Awesome-SpeechLM-Survey<\/em><\/strong><\/a><\/li><\/ul>\n\n\n\n<p>\u5728\u8fd9\u4e2a\u4ee3\u7801\u5e93\u4e2d\uff0c\u6211\u4eec\u7814\u7a76\u4e86\u4ee5\u4e0b\u4e09\u4e2a\u5173\u952e\u9886\u57df\uff1a<strong>(1) \u8868\u5f81\u5b66\u4e60\uff0c(2) \u795e\u7ecf\u7f16\u89e3\u7801\u5668\uff0c\u4ee5\u53ca (3) \u8bed\u8a00\u6a21\u578b<\/strong>\uff0c\u8fd9\u4e9b\u9886\u57df\u5171\u540c\u63a8\u52a8\u4e86\u8bed\u97f3\/\u97f3\u9891\u5927\u8bed\u8a00\u6a21\u578b\u7684\u53d1\u5c55\u3002<\/p>\n\n\n\n<ol><li>\u26a1 <strong>\u8bed\u97f3\u8868\u5f81\u6a21\u578b<\/strong>\uff1a\u8fd9\u4e9b\u6a21\u578b\u4e13\u6ce8\u4e8e\u5b66\u4e60\u8bed\u97f3\u7684\u7ed3\u6784\u5316\u8868\u5f81\uff0c\u968f\u540e\u5c06\u5176\u91cf\u5316\u4e3a\u79bb\u6563\u7684\u8bed\u97f3\u6807\u8bb0\uff0c\u901a\u5e38\u88ab\u79f0\u4e3a<strong>\u8bed\u4e49tokens<\/strong>\u3002<\/li><li>\u26a1 <strong>\u8bed\u97f3\u795e\u7ecf\u7f16\u89e3\u7801\u6a21\u578b<\/strong>\uff1a\u8fd9\u4e9b\u6a21\u578b\u65e8\u5728\u5b66\u4e60\u8bed\u97f3\u548c\u97f3\u9891\u7684\u79bb\u6563\u6807\u8bb0\uff0c\u901a\u5e38\u88ab\u79f0\u4e3a<strong>\u58f0\u5b66tokens<\/strong>\uff0c\u540c\u65f6\u4fdd\u6301\u826f\u597d\u7684\u91cd\u6784\u80fd\u529b\u548c\u4f4e\u6bd4\u7279\u7387\u3002<\/li><li>\u26a1 <strong>\u8bed\u97f3\u5927\u8bed\u8a00\u6a21\u578b<\/strong>\uff1a<strong>\u8fd9\u4e9b\u6a21\u578b\u57fa\u4e8e\u8bed\u97f3\u548c\u58f0\u5b66token\uff0c\u91c7\u7528\u8bed\u8a00\u5efa\u6a21\u65b9\u6cd5\u8fdb\u884c\u8bad\u7ec3\uff0c\u5728\u8bed\u97f3\u7406\u89e3\u548c\u8bed\u97f3\u751f\u6210\u4efb\u52a1\u4e2d\u5c55\u73b0\u51fa\u8f83\u9ad8\u7684\u80fd\u529b\u3002<\/strong><\/li><\/ol>\n\n\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" width=\"717\" height=\"1024\" src=\"http:\/\/139.9.1.231\/wp-content\/uploads\/2024\/11\/Speech-Trident-v4-717x1024.png\" alt=\"\" class=\"wp-image-22165\" srcset=\"http:\/\/139.9.1.231\/wp-content\/uploads\/2024\/11\/Speech-Trident-v4-717x1024.png 717w, http:\/\/139.9.1.231\/wp-content\/uploads\/2024\/11\/Speech-Trident-v4-210x300.png 210w, http:\/\/139.9.1.231\/wp-content\/uploads\/2024\/11\/Speech-Trident-v4-768x1097.png 768w, http:\/\/139.9.1.231\/wp-content\/uploads\/2024\/11\/Speech-Trident-v4-1075x1536.png 1075w, http:\/\/139.9.1.231\/wp-content\/uploads\/2024\/11\/Speech-Trident-v4.png 1344w\" sizes=\"(max-width: 717px) 100vw, 717px\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" width=\"964\" height=\"769\" src=\"http:\/\/139.9.1.231\/wp-content\/uploads\/2025\/06\/image-46.png\" alt=\"\" class=\"wp-image-26999\" srcset=\"http:\/\/139.9.1.231\/wp-content\/uploads\/2025\/06\/image-46.png 964w, http:\/\/139.9.1.231\/wp-content\/uploads\/2025\/06\/image-46-300x239.png 300w, http:\/\/139.9.1.231\/wp-content\/uploads\/2025\/06\/image-46-768x613.png 768w\" sizes=\"(max-width: 964px) 100vw, 964px\" \/><\/figure>\n\n\n\n<h2>Existing SpeechLMs<a href=\"https:\/\/github.com\/dreamtheater123\/Awesome-SpeechLM-Survey?tab=readme-ov-file#existing-speechlms\"><\/a><\/h2>\n\n\n\n<figure class=\"wp-block-table\"><table><thead><tr><th>Model<\/th><th>Title<\/th><th>Url<\/th><\/tr><\/thead><tbody><tr><td>OpenAI Advanced Voice Mode<\/td><td>OpenAI Advanced Voice Mode<\/td><td><a href=\"https:\/\/help.openai.com\/en\/articles\/9617425-advanced-voice-mode-faq\">Link<\/a><\/td><\/tr><tr><td>Claude Voice Mode<\/td><td>Claude Voice Mode<\/td><td><a href=\"https:\/\/support.anthropic.com\/en\/articles\/11101966-using-voice-mode-on-claude-mobile-apps\">Link<\/a><\/td><\/tr><tr><td>MindGPT-4o-Audio<\/td><td>\u7406\u60f3\u540c\u5b66MindGPT-4o-Audio\u5b9e\u65f6\u8bed\u97f3\u5bf9\u8bdd\u5927\u6a21\u578b\u53d1\u5e03<\/td><td><a href=\"https:\/\/mp.weixin.qq.com\/s?__biz=MzkyNzc3ODYzMQ==&amp;mid=2247483808&amp;idx=1&amp;sn=15b2d0fc5c415066e9e85a0e17fa4094&amp;chksm=c313b6c42e4bac7f551e3ce6b314897e6c09b2829d202ae09b088c39b6208d14545221a82785&amp;mpshare=1&amp;scene=1&amp;srcid=06157RruwKQJDuZvxSmt0ALH&amp;sharer_shareinfo=4f156f5dabba628552a2429a555bca65&amp;sharer_shareinfo_first=4f156f5dabba628552a2429a555bca65#rd\">Link<\/a><\/td><\/tr><tr><td>VITA-Audio<\/td><td>VITA-Audio: Fast Interleaved Cross-Modal Token Generation for Efficient Large Speech-Language Model<\/td><td><a href=\"https:\/\/arxiv.org\/abs\/2505.03739\">Link<\/a><\/td><\/tr><tr><td>Voila<\/td><td>Voila: Voice-Language Foundation Models for Real-Time Autonomous Interaction and Voice Role-Play<\/td><td><a href=\"https:\/\/arxiv.org\/abs\/2505.02707\">Link<\/a><\/td><\/tr><tr><td>Kimi-Audio<\/td><td>Kimi-Audio Technical Report<\/td><td><a href=\"https:\/\/arxiv.org\/abs\/2504.18425\">Link<\/a><\/td><\/tr><tr><td>Lyra<\/td><td>Lyra: An Efficient and Speech-Centric Framework for Omni-Cognition<\/td><td><a href=\"https:\/\/arxiv.org\/abs\/2412.09501\">Link<\/a><\/td><\/tr><tr><td>Flow-Omni<\/td><td>Continuous Speech Tokens Makes LLMs Robust Multi-Modality Learners<\/td><td><a href=\"https:\/\/arxiv.org\/abs\/2412.04917\">Link<\/a><\/td><\/tr><tr><td>NTPP<\/td><td>NTPP: Generative Speech Language Modeling for Dual-Channel Spoken Dialogue via Next-Token-Pair Prediction<\/td><td><a href=\"https:\/\/arxiv.org\/abs\/2506.00975\">Link<\/a><\/td><\/tr><tr><td>Qwen2.5-Omni<\/td><td>Qwen2.5-Omni Technical Report<\/td><td><a href=\"https:\/\/arxiv.org\/abs\/2503.20215\">Link<\/a><\/td><\/tr><tr><td>CSM<\/td><td>Conversational Speech Generation Model<\/td><td><a href=\"https:\/\/www.sesame.com\/research\/crossing_the_uncanny_valley_of_voice\">Link<\/a><\/td><\/tr><tr><td>Minmo<\/td><td>MinMo: A Multimodal Large Language Model for Seamless Voice Interaction<\/td><td><a href=\"https:\/\/arxiv.org\/abs\/2501.06282\">Link<\/a><\/td><\/tr><tr><td>Slamming<\/td><td>Slamming: Training a Speech Language Model on One GPU in a Day<\/td><td><a href=\"https:\/\/arxiv.org\/abs\/2502.15814\">Link<\/a><\/td><\/tr><tr><td>VITA-1.5<\/td><td>VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction<\/td><td><a href=\"https:\/\/arxiv.org\/abs\/2501.01957\">Link<\/a><\/td><\/tr><tr><td>Baichuan-Audio<\/td><td>Baichuan-Audio: A Unified Framework for End-to-End Speech Interaction<\/td><td><a href=\"https:\/\/arxiv.org\/abs\/2502.17239\">Link<\/a><\/td><\/tr><tr><td>Step-Audio<\/td><td>Step-Audio: Unified Understanding and Generation in Intelligent Speech Interaction<\/td><td><a href=\"https:\/\/arxiv.org\/abs\/2502.11946\">Link<\/a><\/td><\/tr><tr><td>MiniCPM-o<\/td><td>A GPT-4o Level MLLM for Vision, Speech and Multimodal Live Streaming on Your Phone<\/td><td><a href=\"https:\/\/github.com\/OpenBMB\/MiniCPM-o\">Link<\/a><\/td><\/tr><tr><td>SyncLLM<\/td><td>Beyond Turn-Based Interfaces: Synchronous LLMs as Full-Duplex Dialogue Agents<\/td><td><a href=\"https:\/\/arxiv.org\/abs\/2409.15594\">Link<\/a><\/td><\/tr><tr><td>OmniFlatten<\/td><td>OmniFlatten: An End-to-end GPT Model for Seamless Voice Conversation<\/td><td><a href=\"https:\/\/arxiv.org\/abs\/2410.17799\">Link<\/a><\/td><\/tr><tr><td>SLAM-Omni<\/td><td>SLAM-Omni: Timbre-Controllable Voice Interaction System with Single-Stage Training<\/td><td><a href=\"https:\/\/arxiv.org\/abs\/2412.15649\">Link<\/a><\/td><\/tr><tr><td>GLM-4-Voice<\/td><td>GLM-4-Voice: Towards Intelligent and Human-Like End-to-End Spoken Chatbot<\/td><td><a href=\"https:\/\/arxiv.org\/abs\/2412.02612\">Link<\/a><\/td><\/tr><tr><td>&#8211;<\/td><td>Scaling Speech-Text Pre-training with Synthetic Interleaved Data<\/td><td><a href=\"http:\/\/arxiv.org\/abs\/2411.17607\">Link<\/a><\/td><\/tr><tr><td>SALMONN-omni<\/td><td>SALMONN-omni: A Codec-free LLM for Full-duplex Speech Understanding and Generation<\/td><td><a href=\"http:\/\/arxiv.org\/abs\/2411.18138\">Link<\/a><\/td><\/tr><tr><td>Mini-Omni2<\/td><td>Mini-Omni2: Towards Open-source GPT-4o with Vision, Speech and Duplex Capabilities<\/td><td><a href=\"http:\/\/arxiv.org\/abs\/2410.11190\">Link<\/a><\/td><\/tr><tr><td>Uniaudio<\/td><td>Uniaudio: An audio foundation model toward universal audio generation<\/td><td><a href=\"https:\/\/arxiv.org\/abs\/2310.00704\">Link<\/a><\/td><\/tr><tr><td>Parrot<\/td><td>Parrot: Autoregressive Spoken Dialogue Language Modeling with Decoder-only Transformers<\/td><td><a href=\"https:\/\/openreview.net\/forum?id=Ttndg2Jl5F\">Link<\/a><\/td><\/tr><tr><td>Moshi<\/td><td>Moshi: a speech-text foundation model for real-time dialogue<\/td><td><a href=\"https:\/\/kyutai.org\/Moshi.pdf\">Link<\/a><\/td><\/tr><tr><td>Freeze-Omni<\/td><td>Freeze-Omni: A Smart and Low Latency Speech-to-speech Dialogue Model with Frozen LLM<\/td><td><a href=\"http:\/\/arxiv.org\/abs\/2411.00774\">Link<\/a><\/td><\/tr><tr><td>EMOVA<\/td><td>EMOVA: Empowering Language Models to See, Hear and Speak with Vivid Emotions<\/td><td><a href=\"http:\/\/arxiv.org\/abs\/2409.18042\">Link<\/a><\/td><\/tr><tr><td>IntrinsicVoice<\/td><td>IntrinsicVoice: Empowering LLMs with Intrinsic Real-time Voice Interaction Abilities<\/td><td><a href=\"http:\/\/arxiv.org\/abs\/2410.08035\">Link<\/a><\/td><\/tr><tr><td>LSLM<\/td><td>Language Model Can Listen While Speaking<\/td><td><a href=\"http:\/\/arxiv.org\/abs\/2408.02622\">Link<\/a><\/td><\/tr><tr><td>SpiRit-LM<\/td><td>SpiRit-LM: Interleaved Spoken and Written Language Model<\/td><td><a href=\"http:\/\/arxiv.org\/abs\/2402.05755\">Link<\/a><\/td><\/tr><tr><td>SpeechGPT-Gen<\/td><td>SpeechGPT-Gen: Scaling Chain-of-Information Speech Generation<\/td><td><a href=\"https:\/\/arxiv.org\/abs\/2401.13527v2\">Link<\/a><\/td><\/tr><tr><td>Spectron<\/td><td>Spoken Question Answering and Speech Continuation Using Spectrogram-Powered LLM<\/td><td><a href=\"https:\/\/openreview.net\/forum?id=izrOLJov5y\">Link<\/a><\/td><\/tr><tr><td>SUTLM<\/td><td>Toward Joint Language Modeling for Speech Units and Text<\/td><td><a href=\"http:\/\/arxiv.org\/abs\/2310.08715\">Link<\/a><\/td><\/tr><tr><td>tGSLM<\/td><td>Generative Spoken Language Model based on continuous word-sized audio tokens<\/td><td><a href=\"https:\/\/arxiv.org\/abs\/2310.05224v1\">Link<\/a><\/td><\/tr><tr><td>LauraGPT<\/td><td>LauraGPT: Listen, Attend, Understand, and Regenerate Audio with GPT<\/td><td><a href=\"https:\/\/arxiv.org\/abs\/2310.04673v4\">Link<\/a><\/td><\/tr><tr><td>VoxtLM<\/td><td>VoxtLM: Unified Decoder-Only Models for Consolidating Speech Recognition, Synthesis and Speech, Text Continuation Tasks<\/td><td><a href=\"https:\/\/ieeexplore.ieee.org\/abstract\/document\/10447112\/?casa_token=rNHOTa7BbZMAAAAA:3Dk4RlgUcRbvDIewE9uUk-wk5D_0f2zm1z4hGgG1DSMkiH-KZwk7AVs5Z8PVMetvCKxFdV1C9o0\">Link<\/a><\/td><\/tr><tr><td>VITA<\/td><td>VITA: Towards Open-Source Interactive Omni Multimodal LLM<\/td><td><a href=\"https:\/\/arxiv.org\/abs\/2408.05211\">Link<\/a><\/td><\/tr><tr><td>FunAudioLLM<\/td><td>FunAudioLLM: Voice Understanding and Generation Foundation Models for Natural Interaction Between Humans and LLMs<\/td><td><a href=\"https:\/\/arxiv.org\/abs\/2407.04051\">Link<\/a><\/td><\/tr><tr><td>Voicebox<\/td><td>Voicebox: Text-guided multilingual universal speech generation at scale<\/td><td><a href=\"https:\/\/proceedings.neurips.cc\/paper_files\/paper\/2023\/hash\/2d8911db9ecedf866015091b28946e15-Abstract-Conference.html\">Link<\/a><\/td><\/tr><tr><td>LLaMA-Omni<\/td><td>LLaMA-Omni: Seamless Speech Interaction with Large Language Models<\/td><td><a href=\"https:\/\/arxiv.org\/abs\/2409.06666\">Link<\/a><\/td><\/tr><tr><td>Mini-Omni<\/td><td>Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming<\/td><td><a href=\"https:\/\/arxiv.org\/abs\/2408.16725\">Link<\/a><\/td><\/tr><tr><td>TWIST<\/td><td>Textually pretrained speech language models<\/td><td><a href=\"https:\/\/proceedings.neurips.cc\/paper_files\/paper\/2023\/hash\/c859b99b5d717c9035e79d43dfd69435-Abstract-Conference.html\">Link<\/a><\/td><\/tr><tr><td>GPST<\/td><td>Generative pre-trained speech language model with efficient hierarchical transformer<\/td><td><a href=\"https:\/\/aclanthology.org\/2024.acl-long.97\">Link<\/a><\/td><\/tr><tr><td>AudioPaLM<\/td><td>AudioPaLM: A Large Language Model That Can Speak and Listen<\/td><td><a href=\"http:\/\/arxiv.org\/abs\/2306.12925\">Link<\/a><\/td><\/tr><tr><td>VioLA<\/td><td>VioLA: Unified Codec Language Models for Speech Recognition, Synthesis, and Translation<\/td><td><a href=\"http:\/\/arxiv.org\/abs\/2305.16107\">Link<\/a><\/td><\/tr><tr><td>SpeechGPT<\/td><td>Speechgpt: Empowering large language models with intrinsic cross-modal conversational abilities<\/td><td><a href=\"https:\/\/arxiv.org\/abs\/2305.11000\">Link<\/a><\/td><\/tr><tr><td>dGSLM<\/td><td>Generative spoken dialogue language modeling<\/td><td><a href=\"https:\/\/direct.mit.edu\/tacl\/article-abstract\/doi\/10.1162\/tacl_a_00545\/115240\">Link<\/a><\/td><\/tr><tr><td>pGSLM<\/td><td>Text-Free Prosody-Aware Generative Spoken Language Modeling<\/td><td><a href=\"http:\/\/arxiv.org\/abs\/2109.03264\">Link<\/a><\/td><\/tr><tr><td>GSLM<\/td><td>On generative spoken language modeling from raw audio<\/td><td><a href=\"https:\/\/direct.mit.edu\/tacl\/article-abstract\/doi\/10.1162\/tacl_a_00430\/108611\">Link<\/a><\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<h2>SpeechLM Tokenizers<a href=\"https:\/\/github.com\/dreamtheater123\/Awesome-SpeechLM-Survey?tab=readme-ov-file#speechlm-tokenizers\"><\/a><\/h2>\n\n\n\n<h3>Semantic Tokenizers<a href=\"https:\/\/github.com\/dreamtheater123\/Awesome-SpeechLM-Survey?tab=readme-ov-file#semantic-tokenizers\"><\/a><\/h3>\n\n\n\n<figure class=\"wp-block-table\"><table><thead><tr><th>Name<\/th><th>Title<\/th><th>Url<\/th><\/tr><\/thead><tbody><tr><td>Whisper<\/td><td>Robust Speech Recognition via Large-Scale Weak Supervision<\/td><td><a href=\"https:\/\/arxiv.org\/abs\/2212.04356\">Link<\/a><\/td><\/tr><tr><td>CosyVoice<\/td><td>CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens<\/td><td><a href=\"https:\/\/arxiv.org\/abs\/2407.05407\">Link<\/a><\/td><\/tr><tr><td>Google USM<\/td><td>Google USM: Scaling Automatic Speech Recognition Beyond 100 Languages<\/td><td><a href=\"https:\/\/arxiv.org\/abs\/2303.01037\">Link<\/a><\/td><\/tr><tr><td>WavLM<\/td><td>WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing<\/td><td><a href=\"https:\/\/arxiv.org\/abs\/2110.13900\">Link<\/a><\/td><\/tr><tr><td>HuBERT<\/td><td>HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units<\/td><td><a href=\"https:\/\/arxiv.org\/abs\/2106.07447\">Link<\/a><\/td><\/tr><tr><td>W2v-bert<\/td><td>W2v-BERT: Combining Contrastive Learning and Masked Language Modeling for Self-Supervised Speech Pre-Training<\/td><td><a href=\"https:\/\/arxiv.org\/abs\/2108.06209\">Link<\/a><\/td><\/tr><tr><td>Wav2vec 2.0<\/td><td>wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations<\/td><td><a href=\"https:\/\/arxiv.org\/abs\/2006.11477\">Link<\/a><\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<h3>Acoustic Tokenizers<\/h3>\n\n\n\n<figure class=\"wp-block-table\"><table><thead><tr><th>Name<\/th><th>Title<\/th><th>Url<\/th><\/tr><\/thead><tbody><tr><td>WavTokenizer<\/td><td>WavTokenizer: an Efficient Acoustic Discrete Codec Tokenizer for Audio Language Modeling<\/td><td><a href=\"https:\/\/arxiv.org\/abs\/2408.16532\">Link<\/a><\/td><\/tr><tr><td>SNAC<\/td><td>SNAC: Multi-Scale Neural Audio Codec<\/td><td><a href=\"https:\/\/arxiv.org\/abs\/2410.14411\">Link<\/a><\/td><\/tr><tr><td>Encodec<\/td><td>High Fidelity Neural Audio Compression<\/td><td><a href=\"https:\/\/arxiv.org\/abs\/2210.13438\">Link<\/a><\/td><\/tr><tr><td>SoundStream<\/td><td>SoundStream: An End-to-End Neural Audio Codec<\/td><td><a href=\"https:\/\/arxiv.org\/abs\/2107.03312\">Link<\/a><\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<h3>Mixed Tokenizers<a href=\"https:\/\/github.com\/dreamtheater123\/Awesome-SpeechLM-Survey?tab=readme-ov-file#mixed-tokenizers\"><\/a><\/h3>\n\n\n\n<figure class=\"wp-block-table\"><table><thead><tr><th>Name<\/th><th>Title<\/th><th>Url<\/th><\/tr><\/thead><tbody><tr><td>SpeechTokenizer<\/td><td>SpeechTokenizer: Unified Speech Tokenizer for Speech Large Language Models<\/td><td><a href=\"https:\/\/arxiv.org\/abs\/2308.16692\">Link<\/a><\/td><\/tr><tr><td>Mimi<\/td><td>Moshi: a speech-text foundation model for real-time dialogue<\/td><td><a href=\"https:\/\/arxiv.org\/abs\/2410.00037\">Link<\/a><\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<h2>Popular Training Datasets<a href=\"https:\/\/github.com\/dreamtheater123\/Awesome-SpeechLM-Survey?tab=readme-ov-file#popular-training-datasets\"><\/a><\/h2>\n\n\n\n<figure class=\"wp-block-table\"><table><thead><tr><th>Dataset<\/th><th>Type<\/th><th>Phase<\/th><th>Hours<\/th><th>Year<\/th><\/tr><\/thead><tbody><tr><td><a href=\"https:\/\/www.openslr.org\/12\">LibriSpeech<\/a><\/td><td>ASR<\/td><td>Pre-Training<\/td><td>1k<\/td><td>2015<\/td><\/tr><tr><td><a href=\"https:\/\/www.openslr.org\/94\/\">Multilingual LibriSpeech<\/a><\/td><td>ASR<\/td><td>Pre-Training<\/td><td>50.5k<\/td><td>2020<\/td><\/tr><tr><td><a href=\"https:\/\/github.com\/facebookresearch\/libri-light\">LibriLight<\/a><\/td><td>ASR<\/td><td>Pre-Training<\/td><td>60k<\/td><td>2019<\/td><\/tr><tr><td><a href=\"https:\/\/github.com\/mlcommons\/peoples-speech\">People dataset<\/a><\/td><td>ASR<\/td><td>Pre-Training<\/td><td>30k<\/td><td>2021<\/td><\/tr><tr><td><a href=\"https:\/\/github.com\/facebookresearch\/voxpopuli\">VoxPopuli<\/a><\/td><td>ASR<\/td><td>Pre-Training<\/td><td>1.6k<\/td><td>2021<\/td><\/tr><tr><td><a href=\"https:\/\/github.com\/SpeechColab\/GigaSpeech\">Gigaspeech<\/a><\/td><td>ASR<\/td><td>Pre-Training<\/td><td>40k<\/td><td>2021<\/td><\/tr><tr><td><a href=\"https:\/\/commonvoice.mozilla.org\/zh-CN\">Common Voice<\/a><\/td><td>ASR<\/td><td>Pre-Training<\/td><td>2.5k<\/td><td>2019<\/td><\/tr><tr><td><a href=\"https:\/\/paperswithcode.com\/dataset\/voice-bank-demand\">VCTK<\/a><\/td><td>ASR<\/td><td>Pre-Training<\/td><td>0.3k<\/td><td>2017<\/td><\/tr><tr><td><a href=\"https:\/\/wenet.org.cn\/WenetSpeech\/\">WenetSpeech<\/a><\/td><td>ASR<\/td><td>Pre-Training<\/td><td>22k<\/td><td>2022<\/td><\/tr><tr><td><a href=\"https:\/\/www.openslr.org\/60\/\">LibriTTS<\/a><\/td><td>TTS<\/td><td>Pre-Training<\/td><td>0.6k<\/td><td>2019<\/td><\/tr><tr><td><a href=\"https:\/\/github.com\/facebookresearch\/covost\">CoVoST2<\/a><\/td><td>S2TT<\/td><td>Pre-Training<\/td><td>2.8k<\/td><td>2020<\/td><\/tr><tr><td><a href=\"https:\/\/github.com\/google-research-datasets\/cvss\">CVSS<\/a><\/td><td>S2ST<\/td><td>Pre-Training<\/td><td>1.9k<\/td><td>2022<\/td><\/tr><tr><td><a href=\"https:\/\/www.robots.ox.ac.uk\/~vgg\/data\/voxceleb\/vox1.html\">VoxCeleb<\/a><\/td><td>Speaker Identification<\/td><td>Pre-Training<\/td><td>0.4k<\/td><td>2017<\/td><\/tr><tr><td><a href=\"https:\/\/www.robots.ox.ac.uk\/~vgg\/data\/voxceleb\/vox2.html\">VoxCeleb2<\/a><\/td><td>Speaker Identification<\/td><td>Pre-Training<\/td><td>2.4k<\/td><td>2018<\/td><\/tr><tr><td><a href=\"https:\/\/podcastsdataset.byspotify.com\/\">Spotify Podcasts<\/a><\/td><td>Podcast<\/td><td>Pre-Training<\/td><td>47k<\/td><td>2020<\/td><\/tr><tr><td><a href=\"https:\/\/catalog.ldc.upenn.edu\/LDC2004T19\">Fisher<\/a><\/td><td>Telephone conversation<\/td><td>Pre-Training<\/td><td>2k<\/td><td>2004<\/td><\/tr><tr><td><a href=\"https:\/\/huggingface.co\/datasets\/fnlp\/SpeechInstruct\">SpeechInstruct<\/a><\/td><td>Instruction-following<\/td><td>Instruction-Tuning<\/td><td>&#8211;<\/td><td>2023<\/td><\/tr><tr><td>InstructS2S-200K<\/td><td>Instruction-following<\/td><td>Instruction-Tuning<\/td><td>&#8211;<\/td><td>2024<\/td><\/tr><tr><td><a href=\"https:\/\/huggingface.co\/datasets\/gpt-omni\/VoiceAssistant-400K\">VoiceAssistant-400K<\/a><\/td><td>Instruction-following<\/td><td>Instruction-Tuning<\/td><td>&#8211;<\/td><td>2024<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<h2>Evaluation Benchmarks<\/h2>\n\n\n\n<p><a href=\"https:\/\/github.com\/dreamtheater123\/Awesome-SpeechLM-Survey?tab=readme-ov-file#evaluation-benchmarks\"><\/a><\/p>\n\n\n\n<figure class=\"wp-block-table\"><table><thead><tr><th>Name<\/th><th>Eval Type<\/th><th># Tasks<\/th><th>Audio Type<\/th><th>I\/O<\/th><\/tr><\/thead><tbody><tr><td>ABX<\/td><td>Representation<\/td><td>1<\/td><td>Speech<\/td><td>A\u2192\u2212<\/td><\/tr><tr><td>sWUGGY<\/td><td>Linguistic<\/td><td>1<\/td><td>Speech<\/td><td>A\u2192\u2212<\/td><\/tr><tr><td>sBLIMP<\/td><td>Linguistic<\/td><td>1<\/td><td>Speech<\/td><td>A\u2192\u2212<\/td><\/tr><tr><td><a href=\"https:\/\/github.com\/slp-rl\/SpokenStoryCloze\">sStoryCloze<\/a><\/td><td>Linguistic<\/td><td>1<\/td><td>Speech<\/td><td>A\/T\u2192\u2212<\/td><\/tr><tr><td><a href=\"https:\/\/github.com\/facebookresearch\/spiritlm\/blob\/main\/spiritlm\/eval\/README.md\">STSP<\/a><\/td><td>Paralinguistic<\/td><td>1<\/td><td>Speech<\/td><td>A\/T\u2192A\/T<\/td><\/tr><tr><td><a href=\"https:\/\/github.com\/apple\/axlearn\/tree\/main\/docs\/research\/mmau\">MMAU<\/a><\/td><td>Downstream<\/td><td>27<\/td><td>Speech, Sound, Music<\/td><td>A\u2192T<\/td><\/tr><tr><td><a href=\"https:\/\/github.com\/AudioLLMs\/AudioBench\">Audiobench<\/a><\/td><td>Downstream<\/td><td>8<\/td><td>Speech, Sound<\/td><td>A\u2192T<\/td><\/tr><tr><td><a href=\"https:\/\/github.com\/OFA-Sys\/AIR-Bench\">AIR-Bench<\/a><\/td><td>Downstream<\/td><td>20<\/td><td>Speech, Sound, Music<\/td><td>A\u2192T<\/td><\/tr><tr><td><a href=\"https:\/\/github.com\/amphionspace\/SD-Eval\">SD-Eval<\/a><\/td><td>Downstream<\/td><td>4<\/td><td>Speech<\/td><td>A\u2192T<\/td><\/tr><tr><td><a href=\"https:\/\/huggingface.co\/datasets\/s3prl\/superb\">SUPERB<\/a><\/td><td>Downstream<\/td><td>10<\/td><td>Speech<\/td><td>A\u2192T<\/td><\/tr><tr><td><a href=\"https:\/\/github.com\/dynamic-superb\/dynamic-superb\">Dynamic-SUPERB<\/a><\/td><td>Downstream<\/td><td>180<\/td><td>Speech, Sound, Music<\/td><td>A\u2192T<\/td><\/tr><tr><td><a href=\"https:\/\/huggingface.co\/datasets\/slprl\/SALMon\">SALMON<\/a><\/td><td>Downstream<\/td><td>8<\/td><td>Speech<\/td><td>A\u2192\u2212<\/td><\/tr><tr><td><a href=\"https:\/\/github.com\/matthewcym\/voicebench\">VoiceBench<\/a><\/td><td>Downstream<\/td><td>8<\/td><td>Speech<\/td><td>A\u2192A<\/td><\/tr><tr><td><a href=\"https:\/\/github.com\/dreamtheater123\/VoxEval\">VoxEval<\/a><\/td><td>Downstream<\/td><td>56<\/td><td>Speech<\/td><td>A\u2192A<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<h2>\ud83d\udd31 Speech\/Audio Language Models<a href=\"https:\/\/github.com\/ga642381\/speech-trident\/tree\/master#trident-speechaudio-language-models\"><\/a><\/h2>\n\n\n\n<figure class=\"wp-block-table\"><table><thead><tr><th class=\"has-text-align-center\" data-align=\"center\">Date<\/th><th>Model Name<\/th><th>Paper Title<\/th><th>Link<\/th><\/tr><\/thead><tbody><tr><td class=\"has-text-align-center\" data-align=\"center\">2024-11<\/td><td>&#8212;<\/td><td>Building a Taiwanese Mandarin Spoken Language Model: A First Attempt<\/td><td><a href=\"https:\/\/arxiv.org\/abs\/2411.07111\">Paper<\/a><\/td><\/tr><tr><td class=\"has-text-align-center\" data-align=\"center\">2024-11<\/td><td>Ultravox<\/td><td>Ultravox: An open-weight alternative to GPT-4o Realtime<\/td><td><a href=\"https:\/\/www.ultravox.ai\/blog\/ultravox-an-open-weight-alternative-to-gpt-4o-realtime\">Blog<\/a><\/td><\/tr><tr><td class=\"has-text-align-center\" data-align=\"center\">2024-11<\/td><td>hertz-dev<\/td><td><a href=\"https:\/\/si.inc\/hertz-dev\/\">blog<\/a><\/td><td><a href=\"https:\/\/github.com\/Standard-Intelligence\/hertz-dev\">GitHub<\/a><\/td><\/tr><tr><td class=\"has-text-align-center\" data-align=\"center\">2024-11<\/td><td>Freeze-Omni<\/td><td>Freeze-Omni: A Smart and Low Latency Speech-to-speech Dialogue Model with Frozen LLM<\/td><td><a href=\"https:\/\/arxiv.org\/abs\/2411.00774\">paper<\/a><\/td><\/tr><tr><td class=\"has-text-align-center\" data-align=\"center\">2024-11<\/td><td>Align-SLM<\/td><td>Align-SLM: Textless Spoken Language Models with Reinforcement Learning from AI Feedback<\/td><td><a href=\"https:\/\/arxiv.org\/pdf\/2411.01834\">paper<\/a><\/td><\/tr><tr><td class=\"has-text-align-center\" data-align=\"center\">2024-10<\/td><td>Ichigo<\/td><td>Ichigo: Mixed-Modal Early-Fusion Realtime Voice Assistant<\/td><td><a href=\"https:\/\/arxiv.org\/abs\/2410.15316\">paper<\/a>,&nbsp;<a href=\"https:\/\/github.com\/homebrewltd\/ichigo\">code<\/a><\/td><\/tr><tr><td class=\"has-text-align-center\" data-align=\"center\">2024-10<\/td><td>OmniFlatten<\/td><td>OmniFlatten: An End-to-end GPT Model for Seamless Voice Conversation<\/td><td><a href=\"https:\/\/arxiv.org\/abs\/2410.17799v1\">paper<\/a><\/td><\/tr><tr><td class=\"has-text-align-center\" data-align=\"center\">2024-10<\/td><td>GPT-4o<\/td><td>GPT-4o System Card<\/td><td><a href=\"https:\/\/arxiv.org\/pdf\/2410.21276\">paper<\/a><\/td><\/tr><tr><td class=\"has-text-align-center\" data-align=\"center\">2024-10<\/td><td>Baichuan-OMNI<\/td><td>Baichuan-Omni Technical Report<\/td><td><a href=\"https:\/\/arxiv.org\/abs\/2410.08565\">paper<\/a><\/td><\/tr><tr><td class=\"has-text-align-center\" data-align=\"center\">2024-10<\/td><td>GLM-4-Voice<\/td><td>GLM-4-Voice<\/td><td><a href=\"https:\/\/github.com\/THUDM\/GLM-4-Voice\">GitHub<\/a><\/td><\/tr><tr><td class=\"has-text-align-center\" data-align=\"center\">2024-10<\/td><td>&#8212;<\/td><td>Roadmap towards Superhuman Speech Understanding using Large Language Models<\/td><td><a href=\"https:\/\/arxiv.org\/abs\/2410.13268\">paper<\/a><\/td><\/tr><tr><td class=\"has-text-align-center\" data-align=\"center\">2024-10<\/td><td>SALMONN-OMNI<\/td><td>SALMONN-OMNI: A SPEECH UNDERSTANDING AND GENERATION LLM IN A CODEC-FREE FULL-DUPLEX FRAMEWORK<\/td><td><a href=\"https:\/\/openreview.net\/attachment?id=eJpI20hzWf&amp;name=pdf\">paper<\/a><\/td><\/tr><tr><td class=\"has-text-align-center\" data-align=\"center\">2024-10<\/td><td>Mini-Omni 2<\/td><td>Mini-Omni2: Towards Open-source GPT-4o with Vision, Speech and Duplex Capabilities<\/td><td><a href=\"https:\/\/arxiv.org\/abs\/2410.11190\">paper<\/a><\/td><\/tr><tr><td class=\"has-text-align-center\" data-align=\"center\">2024-10<\/td><td>HALL-E<\/td><td>HALL-E: Hierarchical Neural Codec Language Model for Minute-Long Zero-Shot Text-to-Speech Synthesis<\/td><td><a href=\"https:\/\/openreview.net\/forum?id=868masI331\">paper<\/a><\/td><\/tr><tr><td class=\"has-text-align-center\" data-align=\"center\">2024-10<\/td><td>SyllableLM<\/td><td>SyllableLM: Learning Coarse Semantic Units for Speech Language Models<\/td><td><a href=\"https:\/\/arxiv.org\/html\/2410.04029v1\">paper<\/a><\/td><\/tr><tr><td class=\"has-text-align-center\" data-align=\"center\">2024-09<\/td><td>Moshi<\/td><td>Moshi: a speech-text foundation model for real-time dialogue<\/td><td><a href=\"https:\/\/kyutai.org\/Moshi.pdf\">paper<\/a><\/td><\/tr><tr><td class=\"has-text-align-center\" data-align=\"center\">2024-09<\/td><td>Takin AudioLLM<\/td><td>Takin: A Cohort of Superior Quality Zero-shot Speech Generation Models<\/td><td><a href=\"https:\/\/arxiv.org\/abs\/2409.12139\">paper<\/a><\/td><\/tr><tr><td class=\"has-text-align-center\" data-align=\"center\">2024-09<\/td><td>FireRedTTS<\/td><td>FireRedTTS: A Foundation Text-To-Speech Framework for Industry-Level Generative Speech Applications<\/td><td><a href=\"https:\/\/arxiv.org\/html\/2409.03283v1\">paper<\/a><\/td><\/tr><tr><td class=\"has-text-align-center\" data-align=\"center\">2024-09<\/td><td>LLaMA-Omni<\/td><td>LLaMA-Omni: Seamless Speech Interaction with Large Language Models<\/td><td><a href=\"https:\/\/arxiv.org\/abs\/2409.06666\">paper<\/a><\/td><\/tr><tr><td class=\"has-text-align-center\" data-align=\"center\">2024-09<\/td><td>MaskGCT<\/td><td>MaskGCT: Zero-Shot Text-to-Speech with Masked Generative Codec Transformer<\/td><td><a href=\"https:\/\/arxiv.org\/abs\/2409.00750v1\">paper<\/a><\/td><\/tr><tr><td class=\"has-text-align-center\" data-align=\"center\">2024-09<\/td><td>SSR-Speech<\/td><td>SSR-Speech: Towards Stable, Safe and Robust Zero-shot Text-based Speech Editing and Synthesis<\/td><td><a href=\"https:\/\/arxiv.org\/abs\/2409.07556\">paper<\/a><\/td><\/tr><tr><td class=\"has-text-align-center\" data-align=\"center\">2024-09<\/td><td>MoWE-Audio<\/td><td>MoWE-Audio: Multitask AudioLLMs with Mixture of Weak Encoders<\/td><td><a href=\"https:\/\/arxiv.org\/pdf\/2409.06635\">paper<\/a><\/td><\/tr><tr><td class=\"has-text-align-center\" data-align=\"center\">2024-08<\/td><td>Mini-Omni<\/td><td>Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming<\/td><td><a href=\"https:\/\/arxiv.org\/abs\/2408.16725\">paper<\/a><\/td><\/tr><tr><td class=\"has-text-align-center\" data-align=\"center\">2024-08<\/td><td>Make-A-Voice 2<\/td><td>Make-A-Voice: Revisiting Voice Large Language Models as Scalable Multilingual and Multitask Learner<\/td><td><a href=\"https:\/\/aclanthology.org\/2024.acl-long.589\/\">paper<\/a><\/td><\/tr><tr><td class=\"has-text-align-center\" data-align=\"center\">2024-08<\/td><td>LSLM<\/td><td>Language Model Can Listen While Speaking<\/td><td><a href=\"https:\/\/arxiv.org\/abs\/2408.02622\">paper<\/a><\/td><\/tr><tr><td class=\"has-text-align-center\" data-align=\"center\">2024-06<\/td><td>SimpleSpeech<\/td><td>SimpleSpeech: Towards Simple and Efficient Text-to-Speech with Scalar Latent Transformer Diffusion Models<\/td><td><a href=\"https:\/\/arxiv.org\/abs\/2406.02328\">paper<\/a><\/td><\/tr><tr><td class=\"has-text-align-center\" data-align=\"center\">2024-06<\/td><td>UniAudio 1.5<\/td><td>UniAudio 1.5: Large Language Model-driven Audio Codec is A Few-shot Audio Task Learner<\/td><td><a href=\"https:\/\/arxiv.org\/abs\/2406.10056\">paper<\/a><\/td><\/tr><tr><td class=\"has-text-align-center\" data-align=\"center\">2024-06<\/td><td>VALL-E R<\/td><td>VALL-E R: Robust and Efficient Zero-Shot Text-to-Speech Synthesis via Monotonic Alignment<\/td><td><a href=\"https:\/\/arxiv.org\/abs\/2406.07855\">paper<\/a><\/td><\/tr><tr><td class=\"has-text-align-center\" data-align=\"center\">2024-06<\/td><td>VALL-E 2<\/td><td>VALL-E 2: Neural Codec Language Models are Human Parity Zero-Shot Text to Speech Synthesizers<\/td><td><a href=\"https:\/\/arxiv.org\/abs\/2406.05370\">paper<\/a><\/td><\/tr><tr><td class=\"has-text-align-center\" data-align=\"center\">2024-06<\/td><td>GPST<\/td><td>Generative Pre-trained Speech Language Model with Efficient Hierarchical Transformer<\/td><td><a href=\"https:\/\/arxiv.org\/abs\/2406.00976\">paper<\/a><\/td><\/tr><tr><td class=\"has-text-align-center\" data-align=\"center\">2024-04<\/td><td>CLaM-TTS<\/td><td>CLaM-TTS: Improving Neural Codec Language Model for Zero-Shot Text-to-Speech<\/td><td><a href=\"https:\/\/arxiv.org\/abs\/2404.02781\">paper<\/a><\/td><\/tr><tr><td class=\"has-text-align-center\" data-align=\"center\">2024-04<\/td><td>RALL-E<\/td><td>RALL-E: Robust Codec Language Modeling with Chain-of-Thought Prompting for Text-to-Speech Synthesis<\/td><td><a href=\"https:\/\/arxiv.org\/abs\/2404.03204\">paper<\/a><\/td><\/tr><tr><td class=\"has-text-align-center\" data-align=\"center\">2024-04<\/td><td>WavLLM<\/td><td>WavLLM: Towards Robust and Adaptive Speech Large Language Model<\/td><td><a href=\"https:\/\/arxiv.org\/abs\/2404.00656\">paper<\/a><\/td><\/tr><tr><td class=\"has-text-align-center\" data-align=\"center\">2024-02<\/td><td>MobileSpeech<\/td><td>MobileSpeech: A Fast and High-Fidelity Framework for Mobile Zero-Shot Text-to-Speech<\/td><td><a href=\"https:\/\/arxiv.org\/abs\/2402.09378\">paper<\/a><\/td><\/tr><tr><td class=\"has-text-align-center\" data-align=\"center\">2024-02<\/td><td>SLAM-ASR<\/td><td>An Embarrassingly Simple Approach for LLM with Strong ASR Capacity<\/td><td><a href=\"https:\/\/arxiv.org\/abs\/2402.08846\">paper<\/a><\/td><\/tr><tr><td class=\"has-text-align-center\" data-align=\"center\">2024-02<\/td><td>AnyGPT<\/td><td>AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling<\/td><td><a href=\"https:\/\/arxiv.org\/abs\/2402.12226\">paper<\/a><\/td><\/tr><tr><td class=\"has-text-align-center\" data-align=\"center\">2024-02<\/td><td>SpiRit-LM<\/td><td>SpiRit-LM: Interleaved Spoken and Written Language Model<\/td><td><a href=\"https:\/\/arxiv.org\/abs\/2402.05755\">paper<\/a><\/td><\/tr><tr><td class=\"has-text-align-center\" data-align=\"center\">2024-02<\/td><td>USDM<\/td><td>Integrating Paralinguistics in Speech-Empowered Large Language Models for Natural Conversation<\/td><td><a href=\"https:\/\/arxiv.org\/abs\/2402.05706\">paper<\/a><\/td><\/tr><tr><td class=\"has-text-align-center\" data-align=\"center\">2024-02<\/td><td>BAT<\/td><td>BAT: Learning to Reason about Spatial Sounds with Large Language Models<\/td><td><a href=\"https:\/\/arxiv.org\/abs\/2402.01591\">paper<\/a><\/td><\/tr><tr><td class=\"has-text-align-center\" data-align=\"center\">2024-02<\/td><td>Audio Flamingo<\/td><td>Audio Flamingo: A Novel Audio Language Model with Few-Shot Learning and Dialogue Abilities<\/td><td><a href=\"https:\/\/arxiv.org\/abs\/2402.01831\">paper<\/a><\/td><\/tr><tr><td class=\"has-text-align-center\" data-align=\"center\">2024-02<\/td><td>Text Description to speech<\/td><td>Natural language guidance of high-fidelity text-to-speech with synthetic annotations<\/td><td><a href=\"https:\/\/arxiv.org\/abs\/2402.01912\">paper<\/a><\/td><\/tr><tr><td class=\"has-text-align-center\" data-align=\"center\">2024-02<\/td><td>GenTranslate<\/td><td>GenTranslate: Large Language Models are Generative Multilingual Speech and Machine Translators<\/td><td><a href=\"https:\/\/arxiv.org\/abs\/2402.06894\">paper<\/a><\/td><\/tr><tr><td class=\"has-text-align-center\" data-align=\"center\">2024-02<\/td><td>Base-TTS<\/td><td>BASE TTS: Lessons from building a billion-parameter Text-to-Speech model on 100K hours of data<\/td><td><a href=\"https:\/\/arxiv.org\/abs\/2402.08093\">paper<\/a><\/td><\/tr><tr><td class=\"has-text-align-center\" data-align=\"center\">2024-02<\/td><td>&#8212;<\/td><td>It&#8217;s Never Too Late: Fusing Acoustic Information into Large Language Models for Automatic Speech Recognition<\/td><td><a href=\"https:\/\/arxiv.org\/abs\/2402.05457\">paper<\/a><\/td><\/tr><tr><td class=\"has-text-align-center\" data-align=\"center\">2024-01<\/td><td>&#8212;<\/td><td>Large Language Models are Efficient Learners of Noise-Robust Speech Recognition<\/td><td><a href=\"https:\/\/arxiv.org\/abs\/2401.10446\">paper<\/a><\/td><\/tr><tr><td class=\"has-text-align-center\" data-align=\"center\">2024-01<\/td><td>ELLA-V<\/td><td>ELLA-V: Stable Neural Codec Language Modeling with Alignment-guided Sequence Reordering<\/td><td><a href=\"https:\/\/arxiv.org\/abs\/2401.07333\">paper<\/a><\/td><\/tr><tr><td class=\"has-text-align-center\" data-align=\"center\">2023-12<\/td><td>Seamless<\/td><td>Seamless: Multilingual Expressive and Streaming Speech Translation<\/td><td><a href=\"https:\/\/arxiv.org\/abs\/2312.05187\">paper<\/a><\/td><\/tr><tr><td class=\"has-text-align-center\" data-align=\"center\">2023-11<\/td><td>Qwen-Audio<\/td><td>Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models<\/td><td><a href=\"https:\/\/arxiv.org\/abs\/2311.07919\">paper<\/a><\/td><\/tr><tr><td class=\"has-text-align-center\" data-align=\"center\">2023-10<\/td><td>LauraGPT<\/td><td>LauraGPT: Listen, Attend, Understand, and Regenerate Audio with GPT<\/td><td><a href=\"https:\/\/arxiv.org\/abs\/2310.04673\">paper<\/a><\/td><\/tr><tr><td class=\"has-text-align-center\" data-align=\"center\">2023-10<\/td><td>SALMONN<\/td><td>SALMONN: Towards Generic Hearing Abilities for Large Language Models<\/td><td><a href=\"https:\/\/arxiv.org\/abs\/2310.13289\">paper<\/a><\/td><\/tr><tr><td class=\"has-text-align-center\" data-align=\"center\">2023-10<\/td><td>UniAudio<\/td><td>UniAudio: An Audio Foundation Model Toward Universal Audio Generation<\/td><td><a href=\"https:\/\/arxiv.org\/abs\/2310.00704\">paper<\/a><\/td><\/tr><tr><td class=\"has-text-align-center\" data-align=\"center\">2023-10<\/td><td>Whispering LLaMA<\/td><td>Whispering LLaMA: A Cross-Modal Generative Error Correction Framework for Speech Recognition<\/td><td><a href=\"https:\/\/arxiv.org\/abs\/2310.06434\">paper<\/a><\/td><\/tr><tr><td class=\"has-text-align-center\" data-align=\"center\">2023-09<\/td><td>VoxtLM<\/td><td>Voxtlm: unified decoder-only models for consolidating speech recognition\/synthesis and speech\/text continuation tasks<\/td><td><a href=\"https:\/\/arxiv.org\/abs\/2309.07937\">paper<\/a><\/td><\/tr><tr><td class=\"has-text-align-center\" data-align=\"center\">2023-09<\/td><td>LTU-AS<\/td><td>Joint Audio and Speech Understanding<\/td><td><a href=\"https:\/\/arxiv.org\/abs\/2309.14405\">paper<\/a><\/td><\/tr><tr><td class=\"has-text-align-center\" data-align=\"center\">2023-09<\/td><td>SLM<\/td><td>SLM: Bridge the thin gap between speech and text foundation models<\/td><td><a href=\"https:\/\/arxiv.org\/abs\/2310.00230\">paper<\/a><\/td><\/tr><tr><td class=\"has-text-align-center\" data-align=\"center\">2023-09<\/td><td>&#8212;<\/td><td>Generative Speech Recognition Error Correction with Large Language Models and Task-Activating Prompting<\/td><td><a href=\"https:\/\/arxiv.org\/abs\/2309.15649\">paper<\/a><\/td><\/tr><tr><td class=\"has-text-align-center\" data-align=\"center\">2023-08<\/td><td>SpeechGen<\/td><td>SpeechGen: Unlocking the Generative Power of Speech Language Models with Prompts<\/td><td><a href=\"https:\/\/arxiv.org\/abs\/2306.02207\">paper<\/a><\/td><\/tr><tr><td class=\"has-text-align-center\" data-align=\"center\">2023-08<\/td><td>SpeechX<\/td><td>SpeechX: Neural Codec Language Model as a Versatile Speech Transformer<\/td><td><a href=\"https:\/\/arxiv.org\/abs\/2308.06873\">paper<\/a><\/td><\/tr><tr><td class=\"has-text-align-center\" data-align=\"center\">2023-08<\/td><td>LLaSM<\/td><td>Large Language and Speech Model<\/td><td><a href=\"https:\/\/arxiv.org\/abs\/2308.15930\">paper<\/a><\/td><\/tr><tr><td class=\"has-text-align-center\" data-align=\"center\">2023-08<\/td><td>SeamlessM4T<\/td><td>Massively Multilingual &amp; Multimodal Machine Translation<\/td><td><a href=\"https:\/\/arxiv.org\/abs\/2308.11596\">paper<\/a><\/td><\/tr><tr><td class=\"has-text-align-center\" data-align=\"center\">2023-07<\/td><td>Speech-LLaMA<\/td><td>On decoder-only architecture for speech-to-text and large language model integration<\/td><td><a href=\"https:\/\/arxiv.org\/abs\/2307.03917\">paper<\/a><\/td><\/tr><tr><td class=\"has-text-align-center\" data-align=\"center\">2023-07<\/td><td>LLM-ASR(temp.)<\/td><td>Prompting Large Language Models with Speech Recognition Abilities<\/td><td><a href=\"https:\/\/arxiv.org\/abs\/2307.11795\">paper<\/a><\/td><\/tr><tr><td class=\"has-text-align-center\" data-align=\"center\">2023-06<\/td><td>AudioPaLM<\/td><td>AudioPaLM: A Large Language Model That Can Speak and Listen<\/td><td><a href=\"https:\/\/arxiv.org\/abs\/2306.12925\">paper<\/a><\/td><\/tr><tr><td class=\"has-text-align-center\" data-align=\"center\">2023-05<\/td><td>Make-A-Voice<\/td><td>Make-A-Voice: Unified Voice Synthesis With Discrete Representation<\/td><td><a href=\"https:\/\/arxiv.org\/abs\/2305.19269\">paper<\/a><\/td><\/tr><tr><td class=\"has-text-align-center\" data-align=\"center\">2023-05<\/td><td>Spectron<\/td><td>Spoken Question Answering and Speech Continuation Using Spectrogram-Powered LLM<\/td><td><a href=\"https:\/\/arxiv.org\/abs\/2305.15255\">paper<\/a><\/td><\/tr><tr><td class=\"has-text-align-center\" data-align=\"center\">2023-05<\/td><td>TWIST<\/td><td>Textually Pretrained Speech Language Models<\/td><td><a href=\"https:\/\/arxiv.org\/abs\/2305.13009\">paper<\/a><\/td><\/tr><tr><td class=\"has-text-align-center\" data-align=\"center\">2023-05<\/td><td>Pengi<\/td><td>Pengi: An Audio Language Model for Audio Tasks<\/td><td><a href=\"https:\/\/arxiv.org\/abs\/2305.11834\">paper<\/a><\/td><\/tr><tr><td class=\"has-text-align-center\" data-align=\"center\">2023-05<\/td><td>SoundStorm<\/td><td>Efficient Parallel Audio Generation<\/td><td><a href=\"https:\/\/arxiv.org\/abs\/2305.09636\">paper<\/a><\/td><\/tr><tr><td class=\"has-text-align-center\" data-align=\"center\">2023-05<\/td><td>LTU<\/td><td>Joint Audio and Speech Understanding<\/td><td><a href=\"https:\/\/arxiv.org\/abs\/2305.10790\">paper<\/a><\/td><\/tr><tr><td class=\"has-text-align-center\" data-align=\"center\">2023-05<\/td><td>SpeechGPT<\/td><td>Empowering Large Language Models with Intrinsic Cross-Modal Conversational Abilities<\/td><td><a href=\"https:\/\/arxiv.org\/abs\/2305.11000\">paper<\/a><\/td><\/tr><tr><td class=\"has-text-align-center\" data-align=\"center\">2023-05<\/td><td>VioLA<\/td><td>Unified Codec Language Models for Speech Recognition, Synthesis, and Translation<\/td><td><a href=\"https:\/\/arxiv.org\/abs\/2305.16107\">paper<\/a><\/td><\/tr><tr><td class=\"has-text-align-center\" data-align=\"center\">2023-05<\/td><td>X-LLM<\/td><td>X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages<\/td><td><a href=\"https:\/\/arxiv.org\/abs\/2305.04160\">paper<\/a><\/td><\/tr><tr><td class=\"has-text-align-center\" data-align=\"center\">2023-03<\/td><td>Google USM<\/td><td>Google USM: Scaling Automatic Speech Recognition Beyond 100 Languages<\/td><td><a href=\"https:\/\/arxiv.org\/abs\/2303.01037\">paper<\/a><\/td><\/tr><tr><td class=\"has-text-align-center\" data-align=\"center\">2023-03<\/td><td>VALL-E X<\/td><td>Speak Foreign Languages with Your Own Voice: Cross-Lingual Neural Codec Language Modeling<\/td><td><a href=\"https:\/\/arxiv.org\/abs\/2303.03926\">paper<\/a><\/td><\/tr><tr><td class=\"has-text-align-center\" data-align=\"center\">2023-02<\/td><td>SPEAR-TTS<\/td><td>Speak, Read and Prompt: High-Fidelity Text-to-Speech with Minimal Supervision<\/td><td><a href=\"https:\/\/arxiv.org\/abs\/2302.03540\">paper<\/a><\/td><\/tr><tr><td class=\"has-text-align-center\" data-align=\"center\">2023-01<\/td><td>VALL-E<\/td><td>Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers<\/td><td><a href=\"https:\/\/arxiv.org\/abs\/2301.02111\">paper<\/a><\/td><\/tr><tr><td class=\"has-text-align-center\" data-align=\"center\">2022-12<\/td><td>Whisper<\/td><td>Robust Speech Recognition via Large-Scale Weak Supervision<\/td><td><a href=\"https:\/\/arxiv.org\/abs\/2212.04356\">paper<\/a><\/td><\/tr><tr><td class=\"has-text-align-center\" data-align=\"center\">2022-10<\/td><td>AudioGen<\/td><td>AudioGen: Textually Guided Audio Generation<\/td><td><a href=\"https:\/\/arxiv.org\/abs\/2209.15352\">paper<\/a><\/td><\/tr><tr><td class=\"has-text-align-center\" data-align=\"center\">2022-09<\/td><td>AudioLM<\/td><td>AudioLM: a Language Modeling Approach to Audio Generation<\/td><td><a href=\"https:\/\/arxiv.org\/abs\/2209.03143\">paper<\/a><\/td><\/tr><tr><td class=\"has-text-align-center\" data-align=\"center\">2022-05<\/td><td>Wav2Seq<\/td><td>Wav2Seq: Pre-training Speech-to-Text Encoder-Decoder Models Using Pseudo Languages<\/td><td><a href=\"https:\/\/arxiv.org\/abs\/2205.01086\">paper<\/a><\/td><\/tr><tr><td class=\"has-text-align-center\" data-align=\"center\">2022-04<\/td><td>Unit mBART<\/td><td>Enhanced Direct Speech-to-Speech Translation Using Self-supervised Pre-training and Data Augmentation<\/td><td><a href=\"https:\/\/arxiv.org\/abs\/2204.02967\">paper<\/a><\/td><\/tr><tr><td class=\"has-text-align-center\" data-align=\"center\">2022-03<\/td><td>d-GSLM<\/td><td>Generative Spoken Dialogue Language Modeling<\/td><td><a href=\"https:\/\/arxiv.org\/abs\/2203.16502\">paper<\/a><\/td><\/tr><tr><td class=\"has-text-align-center\" data-align=\"center\">2021-10<\/td><td>SLAM<\/td><td>SLAM: A Unified Encoder for Speech and Language Modeling via Speech-Text Joint Pre-Training<\/td><td><a href=\"https:\/\/arxiv.org\/abs\/2110.10329\">paper<\/a><\/td><\/tr><tr><td class=\"has-text-align-center\" data-align=\"center\">2021-09<\/td><td>p-GSLM<\/td><td>Text-Free Prosody-Aware Generative Spoken Language Modeling<\/td><td><a href=\"https:\/\/arxiv.org\/abs\/2109.03264\">paper<\/a><\/td><\/tr><tr><td class=\"has-text-align-center\" data-align=\"center\">2021-02<\/td><td>GSLM<\/td><td>Generative Spoken Language Modeling from Raw Audio<\/td><td><a href=\"https:\/\/arxiv.org\/abs\/2102.01192\">paper<\/a><\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<h2>\ud83d\udd31 Speech\/Audio Codec Models<\/h2>\n\n\n\n<p><a href=\"https:\/\/github.com\/ga642381\/speech-trident\/tree\/master?tab=readme-ov-file#trident-speechaudio-codec-models\"><\/a><\/p>\n\n\n\n<figure class=\"wp-block-table\"><table><thead><tr><th>Date<\/th><th>Model Name<\/th><th>Paper Title<\/th><th>Link<\/th><\/tr><\/thead><tbody><tr><td>2024-11<\/td><td>PyramidCodec<\/td><td>PyramidCodec: Hierarchical Codec for Long-form Music Generation in Audio Domain<\/td><td><a href=\"https:\/\/aclanthology.org\/2024.findings-emnlp.246.pdf\">paper<\/a><\/td><\/tr><tr><td>2024-11<\/td><td>UniCodec<\/td><td>Universal Speech Token Learning Via Low-Bitrate Neural Codec and Pretrained Representations<\/td><td><a href=\"https:\/\/ieeexplore.ieee.org\/abstract\/document\/10738376?casa_token=eWtmSXEr4AEAAAAA:FzYuQIESJ2LXwl9smJQe3RakpDUFuJ-AS0d39ZDlhsI0tBVX_8P7hu4a59yZezz7hpYd3VomUDo\">paper<\/a><\/td><\/tr><tr><td>2024-11<\/td><td>SimVQ<\/td><td>Addressing Representation Collapse in Vector Quantized Models with One Linear Layer<\/td><td><a href=\"https:\/\/arxiv.org\/pdf\/2411.02038\">paper<\/a><\/td><\/tr><tr><td>2024-11<\/td><td>MDCTCodec<\/td><td>MDCTCodec: A Lightweight MDCT-based Neural Audio Codec towards High Sampling Rate and Low Bitrate Scenarios<\/td><td><a href=\"https:\/\/arxiv.org\/pdf\/2411.00464\">paper<\/a><\/td><\/tr><tr><td>2024-10<\/td><td>APCodec+<\/td><td>APCodec+: A Spectrum-Coding-Based High-Fidelity and High-Compression-Rate Neural Audio Codec with Staged Training Paradigm<\/td><td><a href=\"https:\/\/arxiv.org\/pdf\/2410.22807\">paper<\/a><\/td><\/tr><tr><td>2024-10<\/td><td>&#8211;<\/td><td>A Closer Look at Neural Codec Resynthesis: Bridging the Gap between Codec and Waveform Generation<\/td><td><a href=\"https:\/\/arxiv.org\/pdf\/2410.22448\">paper<\/a><\/td><\/tr><tr><td>2024-10<\/td><td>SNAC<\/td><td>SNAC: Multi-Scale Neural Audio Codec<\/td><td><a href=\"https:\/\/arxiv.org\/pdf\/2410.14411\">paper<\/a><\/td><\/tr><tr><td>2024-10<\/td><td>LSCodec<\/td><td>LSCodec: Low-Bitrate and Speaker-Decoupled Discrete Speech Codec<\/td><td><a href=\"https:\/\/arxiv.org\/abs\/2410.15764\">paper<\/a><\/td><\/tr><tr><td>2024-10<\/td><td>Co-design for codec and codec-LM<\/td><td>TOWARDS CODEC-LM CO-DESIGN FOR NEURAL CODEC LANGUAGE MODELS<\/td><td><a href=\"https:\/\/openreview.net\/pdf?id=KCVv3tICvp\">paper<\/a><\/td><\/tr><tr><td>2024-10<\/td><td>VChangeCodec<\/td><td>VChangeCodec: A High-efficiency Neural Speech Codec with Built-in Voice Changer for Real-time Communication<\/td><td><a href=\"https:\/\/openreview.net\/forum?id=qDSfOQBrOD\">paper<\/a><\/td><\/tr><tr><td>2024-10<\/td><td>DC-Spin<\/td><td>DC-Spin: A Speaker-invariant Speech Tokenizer For Spoken Language Models<\/td><td><a href=\"https:\/\/openreview.net\/forum?id=OW332Wh9S5\">paper<\/a><\/td><\/tr><tr><td>2024-10<\/td><td>TAAE<\/td><td>Scaling Transformers for Low-Bitrate High-Quality Speech Coding<\/td><td><a href=\"https:\/\/openreview.net\/pdf?id=4YpMrGfldX\">paper<\/a><\/td><\/tr><tr><td>2024-10<\/td><td>DM-Codec<\/td><td>DM-Codec: Distilling Multimodal Representations for Speech Tokenization<\/td><td><a href=\"https:\/\/openreview.net\/forum?id=UFwefiypla\">paper<\/a><\/td><\/tr><tr><td>2024-09<\/td><td>Mimi<\/td><td>Moshi: a speech-text foundation model for real-time dialogue<\/td><td><a href=\"https:\/\/kyutai.org\/Moshi.pdf\">paper<\/a><\/td><\/tr><tr><td>2024-09<\/td><td>NDVQ<\/td><td>NDVQ: Robust Neural Audio Codec with Normal Distribution-Based Vector Quantization<\/td><td><a href=\"https:\/\/arxiv.org\/pdf\/2409.12717\">paper<\/a><\/td><\/tr><tr><td>2024-09<\/td><td>SoCodec<\/td><td>SoCodec: A Semantic-Ordered Multi-Stream Speech Codec for Efficient Language Model Based Text-to-Speech Synthesis<\/td><td><a href=\"https:\/\/arxiv.org\/pdf\/2409.00933\">paper<\/a><\/td><\/tr><tr><td>2024-09<\/td><td>BigCodec<\/td><td>BigCodec: Pushing the Limits of Low-Bitrate Neural Speech Codec<\/td><td><a href=\"https:\/\/arxiv.org\/abs\/2409.05377\">paper<\/a><\/td><\/tr><tr><td>2024-08<\/td><td>X-Codec<\/td><td>Codec Does Matter: Exploring the Semantic Shortcoming of Codec for Audio Language Model<\/td><td><a href=\"https:\/\/arxiv.org\/pdf\/2408.17175\">paper<\/a><\/td><\/tr><tr><td>2024-08<\/td><td>WavTokenizer<\/td><td>WavTokenizer: an Efficient Acoustic Discrete Codec Tokenizer for Audio Language Modeling<\/td><td><a href=\"https:\/\/arxiv.org\/abs\/2408.16532\">paper<\/a><\/td><\/tr><tr><td>2024-07<\/td><td>Super-Codec<\/td><td>SuperCodec: A Neural Speech Codec with Selective Back-Projection Network<\/td><td><a href=\"https:\/\/arxiv.org\/abs\/2407.20530\">paper<\/a><\/td><\/tr><tr><td>2024-07<\/td><td>dMel<\/td><td>dMel: Speech Tokenization made Simple<\/td><td><a href=\"https:\/\/arxiv.org\/abs\/2407.15835\">paper<\/a><\/td><\/tr><tr><td>2024-06<\/td><td>CodecFake<\/td><td>CodecFake: Enhancing Anti-Spoofing Models Against Deepfake Audios from Codec-Based Speech Synthesis Systems<\/td><td><a href=\"https:\/\/arxiv.org\/abs\/2406.07237\">paper<\/a><\/td><\/tr><tr><td>2024-06<\/td><td>Single-Codec<\/td><td>Single-Codec: Single-Codebook Speech Codec towards High-Performance Speech Generation<\/td><td><a href=\"https:\/\/www.arxiv.org\/abs\/2406.07422\">paper<\/a><\/td><\/tr><tr><td>2024-06<\/td><td>SQ-Codec<\/td><td>SimpleSpeech: Towards Simple and Efficient Text-to-Speech with Scalar Latent Transformer Diffusion Models<\/td><td><a href=\"https:\/\/arxiv.org\/abs\/2406.02328\">paper<\/a><\/td><\/tr><tr><td>2024-06<\/td><td>PQ-VAE<\/td><td>Addressing Index Collapse of Large-Codebook Speech Tokenizer with Dual-Decoding Product-Quantized Variational Auto-Encoder<\/td><td><a href=\"https:\/\/arxiv.org\/abs\/2406.02940\">paper<\/a><\/td><\/tr><tr><td>2024-06<\/td><td>LLM-Codec<\/td><td>UniAudio 1.5: Large Language Model-driven Audio Codec is A Few-shot Audio Task Learner<\/td><td><a href=\"https:\/\/arxiv.org\/abs\/2406.10056\">paper<\/a><\/td><\/tr><tr><td>2024-05<\/td><td>HILCodec<\/td><td>HILCodec: High Fidelity and Lightweight Neural Audio Codec<\/td><td><a href=\"https:\/\/arxiv.org\/abs\/2405.04752\">paper<\/a><\/td><\/tr><tr><td>2024-04<\/td><td>SemantiCodec<\/td><td>SemantiCodec: An Ultra Low Bitrate Semantic Audio Codec for General Sound<\/td><td><a href=\"https:\/\/arxiv.org\/abs\/2405.00233\">paper<\/a><\/td><\/tr><tr><td>2024-04<\/td><td>PromptCodec<\/td><td>PromptCodec: High-Fidelity Neural Speech Codec using Disentangled Representation Learning based Adaptive Feature-aware Prompt Encoders<\/td><td><a href=\"https:\/\/arxiv.org\/abs\/2404.02702\">paper<\/a><\/td><\/tr><tr><td>2024-04<\/td><td>ESC<\/td><td>ESC: Efficient Speech Coding with Cross-Scale Residual Vector Quantized Transformers<\/td><td><a href=\"https:\/\/arxiv.org\/abs\/2404.19441\">paper<\/a><\/td><\/tr><tr><td>2024-03<\/td><td>FACodec<\/td><td>NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models<\/td><td><a href=\"https:\/\/arxiv.org\/abs\/2403.03100\">paper<\/a><\/td><\/tr><tr><td>2024-02<\/td><td>AP-Codec<\/td><td>APCodec: A Neural Audio Codec with Parallel Amplitude and Phase Spectrum Encoding and Decoding<\/td><td><a href=\"https:\/\/arxiv.org\/abs\/2402.10533\">paper<\/a><\/td><\/tr><tr><td>2024-02<\/td><td>Language-Codec<\/td><td>Language-Codec: Reducing the Gaps Between Discrete Codec Representation and Speech Language Models<\/td><td><a href=\"https:\/\/arxiv.org\/abs\/2402.12208\">paper<\/a><\/td><\/tr><tr><td>2024-01<\/td><td>ScoreDec<\/td><td>ScoreDec: A Phase-preserving High-Fidelity Audio Codec with A Generalized Score-based Diffusion Post-filter<\/td><td><a href=\"https:\/\/arxiv.org\/abs\/2401.12160\">paper<\/a><\/td><\/tr><tr><td>2023-11<\/td><td>HierSpeech++<\/td><td>HierSpeech++: Bridging the Gap between Semantic and Acoustic Representation of Speech by Hierarchical Variational Inference for Zero-shot Speech Synthesis<\/td><td><a href=\"https:\/\/arxiv.org\/abs\/2311.12454\">paper<\/a><\/td><\/tr><tr><td>2023-10<\/td><td>TiCodec<\/td><td>FEWER-TOKEN NEURAL SPEECH CODEC WITH TIME-INVARIANT CODES<\/td><td><a href=\"https:\/\/arxiv.org\/pdf\/2310.00014\">paper<\/a><\/td><\/tr><tr><td>2023-09<\/td><td>RepCodec<\/td><td>RepCodec: A Speech Representation Codec for Speech Tokenization<\/td><td><a href=\"https:\/\/arxiv.org\/abs\/2309.00169\">paper<\/a><\/td><\/tr><tr><td>2023-09<\/td><td>FunCodec<\/td><td>FunCodec: A Fundamental, Reproducible and Integrable Open-source Toolkit for Neural Speech Codec<\/td><td><a href=\"https:\/\/arxiv.org\/abs\/2309.07405\">paper<\/a><\/td><\/tr><tr><td>2023-08<\/td><td>SpeechTokenizer<\/td><td>Speechtokenizer: Unified speech tokenizer for speech large language models<\/td><td><a href=\"https:\/\/arxiv.org\/abs\/2308.16692\">paper<\/a><\/td><\/tr><tr><td>2023-06<\/td><td>VOCOS<\/td><td>VOCOS: CLOSING THE GAP BETWEEN TIME-DOMAIN AND FOURIER-BASED NEURAL VOCODERS FOR HIGH-QUALITY AUDIO SYNTHESIS<\/td><td><a href=\"https:\/\/arxiv.org\/pdf\/2306.00814\">paper<\/a><\/td><\/tr><tr><td>2023-06<\/td><td>Descript-audio-codec<\/td><td>High-Fidelity Audio Compression with Improved RVQGAN<\/td><td><a href=\"https:\/\/arxiv.org\/abs\/2306.06546\">paper<\/a><\/td><\/tr><tr><td>2023-05<\/td><td>AudioDec<\/td><td>Audiodec: An open-source streaming highfidelity neural audio codec<\/td><td><a href=\"https:\/\/arxiv.org\/abs\/2305.16608\">paper<\/a><\/td><\/tr><tr><td>2023-05<\/td><td>HiFi-Codec<\/td><td>Hifi-codec: Group-residual vector quantization for high fidelity audio codec<\/td><td><a href=\"https:\/\/arxiv.org\/abs\/2305.02765\">paper<\/a><\/td><\/tr><tr><td>2023-03<\/td><td>LMCodec<\/td><td>LMCodec: A Low Bitrate Speech Codec With Causal Transformer Models<\/td><td><a href=\"https:\/\/arxiv.org\/abs\/2303.12984\">paper<\/a><\/td><\/tr><tr><td>2022-11<\/td><td>Disen-TF-Codec<\/td><td>Disentangled Feature Learning for Real-Time Neural Speech Coding<\/td><td><a href=\"https:\/\/arxiv.org\/abs\/2211.11960\">paper<\/a><\/td><\/tr><tr><td>2022-10<\/td><td>EnCodec<\/td><td>High fidelity neural audio compression<\/td><td><a href=\"https:\/\/arxiv.org\/abs\/2210.13438\">paper<\/a><\/td><\/tr><tr><td>2022-07<\/td><td>S-TFNet<\/td><td>Cross-Scale Vector Quantization for Scalable Neural Speech Coding<\/td><td><a href=\"https:\/\/arxiv.org\/abs\/2207.03067\">paper<\/a><\/td><\/tr><tr><td>2022-01<\/td><td>TFNet<\/td><td>End-to-End Neural Speech Coding for Real-Time Communications<\/td><td><a href=\"https:\/\/arxiv.org\/abs\/2201.09429\">paper<\/a><\/td><\/tr><tr><td>2021-07<\/td><td>SoundStream<\/td><td>SoundStream: An End-to-End Neural Audio Codec<\/td><td><a href=\"https:\/\/arxiv.org\/abs\/2107.03312\">paper<\/a><\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<h2>Speech\/Audio Representation Models<\/h2>\n\n\n\n<p><a href=\"https:\/\/github.com\/ga642381\/speech-trident\/tree\/master?tab=readme-ov-file#trident-speechaudio-representation-models\"><\/a><\/p>\n\n\n\n<figure class=\"wp-block-table\"><table><thead><tr><th>Date<\/th><th>Model Name<\/th><th>Paper Title<\/th><th>Link<\/th><\/tr><\/thead><tbody><tr><td>2024-09<\/td><td>NEST-RQ<\/td><td>NEST-RQ: Next Token Prediction for Speech Self-Supervised Pre-Training<\/td><td><a href=\"https:\/\/arxiv.org\/pdf\/2409.08680\">paper<\/a><\/td><\/tr><tr><td>2024-01<\/td><td>EAT<\/td><td>Self-Supervised Pre-Training with Efficient Audio Transformer<\/td><td><a href=\"https:\/\/arxiv.org\/abs\/2401.03497\">paper<\/a><\/td><\/tr><tr><td>2023-10<\/td><td>MR-HuBERT<\/td><td>Multi-resolution HuBERT: Multi-resolution Speech Self-Supervised Learning with Masked Unit Prediction<\/td><td><a href=\"https:\/\/arxiv.org\/abs\/2310.02720\">paper<\/a><\/td><\/tr><tr><td>2023-10<\/td><td>SpeechFlow<\/td><td>Generative Pre-training for Speech with Flow Matching<\/td><td><a href=\"https:\/\/arxiv.org\/abs\/2310.16338\">paper<\/a><\/td><\/tr><tr><td>2023-09<\/td><td>WavLabLM<\/td><td>Joint Prediction and Denoising for Large-scale Multilingual Self-supervised Learning<\/td><td><a href=\"https:\/\/arxiv.org\/abs\/2309.15317\">paper<\/a><\/td><\/tr><tr><td>2023-08<\/td><td>W2v-BERT 2.0<\/td><td>Massively Multilingual &amp; Multimodal Machine Translation<\/td><td><a href=\"https:\/\/arxiv.org\/abs\/2308.11596\">paper<\/a><\/td><\/tr><tr><td>2023-07<\/td><td>Whisper-AT<\/td><td>Noise-Robust Automatic Speech Recognizers are Also Strong General Audio Event Taggers<\/td><td><a href=\"https:\/\/arxiv.org\/abs\/2307.03183\">paper<\/a><\/td><\/tr><tr><td>2023-06<\/td><td>ATST<\/td><td>Self-supervised Audio Teacher-Student Transformer for Both Clip-level and Frame-level Tasks<\/td><td><a href=\"https:\/\/arxiv.org\/abs\/2306.04186\">paper<\/a><\/td><\/tr><tr><td>2023-05<\/td><td>SPIN<\/td><td>Self-supervised Fine-tuning for Improved Content Representations by Speaker-invariant Clustering<\/td><td><a href=\"https:\/\/arxiv.org\/abs\/2305.11072\">paper<\/a><\/td><\/tr><tr><td>2023-05<\/td><td>DinoSR<\/td><td>Self-Distillation and Online Clustering for Self-supervised Speech Representation Learning<\/td><td><a href=\"https:\/\/arxiv.org\/abs\/2305.10005\">paper<\/a><\/td><\/tr><tr><td>2023-05<\/td><td>NFA<\/td><td>Self-supervised neural factor analysis for disentangling utterance-level speech representations<\/td><td><a href=\"https:\/\/arxiv.org\/abs\/2305.08099\">paper<\/a><\/td><\/tr><tr><td>2022-12<\/td><td>Data2vec 2.0<\/td><td>Efficient Self-supervised Learning with Contextualized Target Representations for Vision, Speech and Language<\/td><td><a href=\"https:\/\/arxiv.org\/abs\/2212.07525\">paper<\/a><\/td><\/tr><tr><td>2022-12<\/td><td>BEATs<\/td><td>Audio Pre-Training with Acoustic Tokenizers<\/td><td><a href=\"https:\/\/arxiv.org\/abs\/2212.09058\">paper<\/a><\/td><\/tr><tr><td>2022-11<\/td><td>MT4SSL<\/td><td>MT4SSL: Boosting Self-Supervised Speech Representation Learning by Integrating Multiple Targets<\/td><td><a href=\"https:\/\/arxiv.org\/abs\/2211.07321\">paper<\/a><\/td><\/tr><tr><td>2022-08<\/td><td>DINO<\/td><td>Non-contrastive self-supervised learning of utterance-level speech representations<\/td><td><a href=\"https:\/\/arxiv.org\/abs\/2208.05413\">paper<\/a><\/td><\/tr><tr><td>2022-07<\/td><td>Audio-MAE<\/td><td>Masked Autoencoders that Listen<\/td><td><a href=\"https:\/\/arxiv.org\/abs\/2207.06405\">paper<\/a><\/td><\/tr><tr><td>2022-04<\/td><td>MAESTRO<\/td><td>Matched Speech Text Representations through Modality Matching<\/td><td><a href=\"https:\/\/arxiv.org\/abs\/2204.03409\">paper<\/a><\/td><\/tr><tr><td>2022-03<\/td><td>MAE-AST<\/td><td>Masked Autoencoding Audio Spectrogram Transformer<\/td><td><a href=\"https:\/\/arxiv.org\/abs\/2203.16691\">paper<\/a><\/td><\/tr><tr><td>2022-03<\/td><td>LightHuBERT<\/td><td>Lightweight and Configurable Speech Representation Learning with Once-for-All Hidden-Unit BERT<\/td><td><a href=\"https:\/\/arxiv.org\/abs\/2203.15610\">paper<\/a><\/td><\/tr><tr><td>2022-02<\/td><td>Data2vec<\/td><td>A General Framework for Self-supervised Learning in Speech, Vision and Language<\/td><td><a href=\"https:\/\/arxiv.org\/abs\/2202.03555\">paper<\/a><\/td><\/tr><tr><td>2021-10<\/td><td>WavLM<\/td><td>WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing<\/td><td><a href=\"https:\/\/arxiv.org\/abs\/2110.13900\">paper<\/a><\/td><\/tr><tr><td>2021-08<\/td><td>W2v-BERT<\/td><td>Combining Contrastive Learning and Masked Language Modeling for Self-Supervised Speech Pre-Training<\/td><td><a href=\"https:\/\/arxiv.org\/abs\/2108.06209\">paper<\/a><\/td><\/tr><tr><td>2021-07<\/td><td>mHuBERT<\/td><td>Direct speech-to-speech translation with discrete units<\/td><td><a href=\"https:\/\/arxiv.org\/abs\/2107.05604\">paper<\/a><\/td><\/tr><tr><td>2021-06<\/td><td>HuBERT<\/td><td>Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units<\/td><td><a href=\"https:\/\/arxiv.org\/abs\/2106.07447\">paper<\/a><\/td><\/tr><tr><td>2021-03<\/td><td>BYOL-A<\/td><td>Self-Supervised Learning for General-Purpose Audio Representation<\/td><td><a href=\"https:\/\/arxiv.org\/abs\/2103.06695\">paper<\/a><\/td><\/tr><tr><td>2020-12<\/td><td>DeCoAR2.0<\/td><td>DeCoAR 2.0: Deep Contextualized Acoustic Representations with Vector Quantization<\/td><td><a href=\"https:\/\/arxiv.org\/abs\/2012.06659\">paper<\/a><\/td><\/tr><tr><td>2020-07<\/td><td>TERA<\/td><td>TERA: Self-Supervised Learning of Transformer Encoder Representation for Speech<\/td><td><a href=\"https:\/\/arxiv.org\/abs\/2007.06028\">paper<\/a><\/td><\/tr><tr><td>2020-06<\/td><td>Wav2vec2.0<\/td><td>wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations<\/td><td><a href=\"https:\/\/arxiv.org\/abs\/2006.11477\">paper<\/a><\/td><\/tr><tr><td>2019-10<\/td><td>APC<\/td><td>Generative Pre-Training for Speech with Autoregressive Predictive Coding<\/td><td><a href=\"https:\/\/arxiv.org\/abs\/1910.12607\">paper<\/a><\/td><\/tr><tr><td>2018-07<\/td><td>CPC<\/td><td>Representation Learning with Contrastive Predictive Coding<\/td><td><a href=\"https:\/\/arxiv.org\/abs\/1807.03748\">paper<\/a><\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<h2>\ud83d\udd31 Related Repository<a href=\"https:\/\/github.com\/ga642381\/speech-trident\/tree\/master?tab=readme-ov-file#trident-related-repository\"><\/a><\/h2>\n\n\n\n<ul><li><a href=\"https:\/\/github.com\/liusongxiang\/Large-Audio-Models\"><strong><em>https:\/\/github.com\/liusongxiang\/Large-Audio-Models<\/em><\/strong><\/a><\/li><li><strong><em><a href=\"https:\/\/github.com\/kuan2jiu99\/Awesome-Speech-Generation\">https:\/\/github.com\/kuan2jiu99\/Awesome-Speech-Generation<\/a><\/em><\/strong><\/li><li><strong><em><a href=\"https:\/\/github.com\/ga642381\/Speech-Prompts-Adapters\">https:\/\/github.com\/ga642381\/Speech-Prompts-Adapters<\/a><\/em><\/strong><\/li><li><strong><em><a href=\"https:\/\/github.com\/voidful\/Codec-SUPERB\">https:\/\/github.com\/voidful\/Codec-SUPERB<\/a><\/em><\/strong><\/li><li><strong><em><a href=\"https:\/\/github.com\/huckiyang\/awesome-neural-reprogramming-prompting\">https:\/\/github.com\/huckiyang\/awesome-neural-reprogramming-prompting<\/a><\/em><\/strong><\/li><li><strong><em><a href=\"https:\/\/github.com\/dreamtheater123\/Awesome-SpeechLM-Survey\">https:\/\/github.com\/dreamtheater123\/Awesome-SpeechLM-Survey<\/a><\/em><\/strong><\/li><\/ul>\n","protected":false},"excerpt":{"rendered":"<p>GitHub\uff1ahttps:\/\/github.com\/ga642381\/speech-trident\/tree\/ &hellip; <a href=\"http:\/\/139.9.1.231\/index.php\/2024\/11\/25\/awesome-speech-lm\/\" class=\"more-link\">\u7ee7\u7eed\u9605\u8bfb<span class=\"screen-reader-text\">Awesome Speech LM Survey-\u8bed\u97f3\u5927\u6a21\u578b\u7efc\u8ff0<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[21,40,4,9,38,34],"tags":[],"_links":{"self":[{"href":"http:\/\/139.9.1.231\/index.php\/wp-json\/wp\/v2\/posts\/22161"}],"collection":[{"href":"http:\/\/139.9.1.231\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/139.9.1.231\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/139.9.1.231\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"http:\/\/139.9.1.231\/index.php\/wp-json\/wp\/v2\/comments?post=22161"}],"version-history":[{"count":26,"href":"http:\/\/139.9.1.231\/index.php\/wp-json\/wp\/v2\/posts\/22161\/revisions"}],"predecessor-version":[{"id":27002,"href":"http:\/\/139.9.1.231\/index.php\/wp-json\/wp\/v2\/posts\/22161\/revisions\/27002"}],"wp:attachment":[{"href":"http:\/\/139.9.1.231\/index.php\/wp-json\/wp\/v2\/media?parent=22161"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/139.9.1.231\/index.php\/wp-json\/wp\/v2\/categories?post=22161"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/139.9.1.231\/index.php\/wp-json\/wp\/v2\/tags?post=22161"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}