Awesome Speech LM Survey-语音大模型综述

GitHub：https://github.com/ga642381/speech-trident/tree/master
GitHub：https://github.com/dreamtheater123/Awesome-SpeechLM-Survey

在这个代码库中，我们研究了以下三个关键领域：(1) 表征学习，(2) 神经编解码器，以及 (3) 语言模型，这些领域共同推动了语音/音频大语言模型的发展。

⚡ 语音表征模型：这些模型专注于学习语音的结构化表征，随后将其量化为离散的语音标记，通常被称为语义tokens。
⚡ 语音神经编解码模型：这些模型旨在学习语音和音频的离散标记，通常被称为声学tokens，同时保持良好的重构能力和低比特率。
⚡ 语音大语言模型：这些模型基于语音和声学token，采用语言建模方法进行训练，在语音理解和语音生成任务中展现出较高的能力。

内容隐藏

1 Existing SpeechLMs

2 SpeechLM Tokenizers

2.1 Semantic Tokenizers

2.2 Acoustic Tokenizers

2.3 Mixed Tokenizers

3 Popular Training Datasets

4 Evaluation Benchmarks

5 🔱 Speech/Audio Language Models

6 🔱 Speech/Audio Codec Models

7 Speech/Audio Representation Models

8 🔱 Related Repository

8.1 相关文章：

Existing SpeechLMs

Model	Title	Url
OpenAI Advanced Voice Mode	OpenAI Advanced Voice Mode	Link
Claude Voice Mode	Claude Voice Mode	Link
MindGPT-4o-Audio	理想同学MindGPT-4o-Audio实时语音对话大模型发布	Link
VITA-Audio	VITA-Audio: Fast Interleaved Cross-Modal Token Generation for Efficient Large Speech-Language Model	Link
Voila	Voila: Voice-Language Foundation Models for Real-Time Autonomous Interaction and Voice Role-Play	Link
Kimi-Audio	Kimi-Audio Technical Report	Link
Lyra	Lyra: An Efficient and Speech-Centric Framework for Omni-Cognition	Link
Flow-Omni	Continuous Speech Tokens Makes LLMs Robust Multi-Modality Learners	Link
NTPP	NTPP: Generative Speech Language Modeling for Dual-Channel Spoken Dialogue via Next-Token-Pair Prediction	Link
Qwen2.5-Omni	Qwen2.5-Omni Technical Report	Link
CSM	Conversational Speech Generation Model	Link
Minmo	MinMo: A Multimodal Large Language Model for Seamless Voice Interaction	Link
Slamming	Slamming: Training a Speech Language Model on One GPU in a Day	Link
VITA-1.5	VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction	Link
Baichuan-Audio	Baichuan-Audio: A Unified Framework for End-to-End Speech Interaction	Link
Step-Audio	Step-Audio: Unified Understanding and Generation in Intelligent Speech Interaction	Link
MiniCPM-o	A GPT-4o Level MLLM for Vision, Speech and Multimodal Live Streaming on Your Phone	Link
SyncLLM	Beyond Turn-Based Interfaces: Synchronous LLMs as Full-Duplex Dialogue Agents	Link
OmniFlatten	OmniFlatten: An End-to-end GPT Model for Seamless Voice Conversation	Link
SLAM-Omni	SLAM-Omni: Timbre-Controllable Voice Interaction System with Single-Stage Training	Link
GLM-4-Voice	GLM-4-Voice: Towards Intelligent and Human-Like End-to-End Spoken Chatbot	Link
–	Scaling Speech-Text Pre-training with Synthetic Interleaved Data	Link
SALMONN-omni	SALMONN-omni: A Codec-free LLM for Full-duplex Speech Understanding and Generation	Link
Mini-Omni2	Mini-Omni2: Towards Open-source GPT-4o with Vision, Speech and Duplex Capabilities	Link
Uniaudio	Uniaudio: An audio foundation model toward universal audio generation	Link
Parrot	Parrot: Autoregressive Spoken Dialogue Language Modeling with Decoder-only Transformers	Link
Moshi	Moshi: a speech-text foundation model for real-time dialogue	Link
Freeze-Omni	Freeze-Omni: A Smart and Low Latency Speech-to-speech Dialogue Model with Frozen LLM	Link
EMOVA	EMOVA: Empowering Language Models to See, Hear and Speak with Vivid Emotions	Link
IntrinsicVoice	IntrinsicVoice: Empowering LLMs with Intrinsic Real-time Voice Interaction Abilities	Link
LSLM	Language Model Can Listen While Speaking	Link
SpiRit-LM	SpiRit-LM: Interleaved Spoken and Written Language Model	Link
SpeechGPT-Gen	SpeechGPT-Gen: Scaling Chain-of-Information Speech Generation	Link
Spectron	Spoken Question Answering and Speech Continuation Using Spectrogram-Powered LLM	Link
SUTLM	Toward Joint Language Modeling for Speech Units and Text	Link
tGSLM	Generative Spoken Language Model based on continuous word-sized audio tokens	Link
LauraGPT	LauraGPT: Listen, Attend, Understand, and Regenerate Audio with GPT	Link
VoxtLM	VoxtLM: Unified Decoder-Only Models for Consolidating Speech Recognition, Synthesis and Speech, Text Continuation Tasks	Link
VITA	VITA: Towards Open-Source Interactive Omni Multimodal LLM	Link
FunAudioLLM	FunAudioLLM: Voice Understanding and Generation Foundation Models for Natural Interaction Between Humans and LLMs	Link
Voicebox	Voicebox: Text-guided multilingual universal speech generation at scale	Link
LLaMA-Omni	LLaMA-Omni: Seamless Speech Interaction with Large Language Models	Link
Mini-Omni	Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming	Link
TWIST	Textually pretrained speech language models	Link
GPST	Generative pre-trained speech language model with efficient hierarchical transformer	Link
AudioPaLM	AudioPaLM: A Large Language Model That Can Speak and Listen	Link
VioLA	VioLA: Unified Codec Language Models for Speech Recognition, Synthesis, and Translation	Link
SpeechGPT	Speechgpt: Empowering large language models with intrinsic cross-modal conversational abilities	Link
dGSLM	Generative spoken dialogue language modeling	Link
pGSLM	Text-Free Prosody-Aware Generative Spoken Language Modeling	Link
GSLM	On generative spoken language modeling from raw audio	Link

SpeechLM Tokenizers

Semantic Tokenizers

Name	Title	Url
Whisper	Robust Speech Recognition via Large-Scale Weak Supervision	Link
CosyVoice	CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens	Link
Google USM	Google USM: Scaling Automatic Speech Recognition Beyond 100 Languages	Link
WavLM	WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing	Link
HuBERT	HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units	Link
W2v-bert	W2v-BERT: Combining Contrastive Learning and Masked Language Modeling for Self-Supervised Speech Pre-Training	Link
Wav2vec 2.0	wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations	Link

Acoustic Tokenizers

Name	Title	Url
WavTokenizer	WavTokenizer: an Efficient Acoustic Discrete Codec Tokenizer for Audio Language Modeling	Link
SNAC	SNAC: Multi-Scale Neural Audio Codec	Link
Encodec	High Fidelity Neural Audio Compression	Link
SoundStream	SoundStream: An End-to-End Neural Audio Codec	Link

Mixed Tokenizers

Name	Title	Url
SpeechTokenizer	SpeechTokenizer: Unified Speech Tokenizer for Speech Large Language Models	Link
Mimi	Moshi: a speech-text foundation model for real-time dialogue	Link

Popular Training Datasets

Dataset	Type	Phase	Hours	Year
LibriSpeech	ASR	Pre-Training	1k	2015
Multilingual LibriSpeech	ASR	Pre-Training	50.5k	2020
LibriLight	ASR	Pre-Training	60k	2019
People dataset	ASR	Pre-Training	30k	2021
VoxPopuli	ASR	Pre-Training	1.6k	2021
Gigaspeech	ASR	Pre-Training	40k	2021
Common Voice	ASR	Pre-Training	2.5k	2019
VCTK	ASR	Pre-Training	0.3k	2017
WenetSpeech	ASR	Pre-Training	22k	2022
LibriTTS	TTS	Pre-Training	0.6k	2019
CoVoST2	S2TT	Pre-Training	2.8k	2020
CVSS	S2ST	Pre-Training	1.9k	2022
VoxCeleb	Speaker Identification	Pre-Training	0.4k	2017
VoxCeleb2	Speaker Identification	Pre-Training	2.4k	2018
Spotify Podcasts	Podcast	Pre-Training	47k	2020
Fisher	Telephone conversation	Pre-Training	2k	2004
SpeechInstruct	Instruction-following	Instruction-Tuning	–	2023
InstructS2S-200K	Instruction-following	Instruction-Tuning	–	2024
VoiceAssistant-400K	Instruction-following	Instruction-Tuning	–	2024

Evaluation Benchmarks

Name	Eval Type	# Tasks	Audio Type	I/O
ABX	Representation	1	Speech	A→−
sWUGGY	Linguistic	1	Speech	A→−
sBLIMP	Linguistic	1	Speech	A→−
sStoryCloze	Linguistic	1	Speech	A/T→−
STSP	Paralinguistic	1	Speech	A/T→A/T
MMAU	Downstream	27	Speech, Sound, Music	A→T
Audiobench	Downstream	8	Speech, Sound	A→T
AIR-Bench	Downstream	20	Speech, Sound, Music	A→T
SD-Eval	Downstream	4	Speech	A→T
SUPERB	Downstream	10	Speech	A→T
Dynamic-SUPERB	Downstream	180	Speech, Sound, Music	A→T
SALMON	Downstream	8	Speech	A→−
VoiceBench	Downstream	8	Speech	A→A
VoxEval	Downstream	56	Speech	A→A

🔱 Speech/Audio Language Models

Date	Model Name	Paper Title	Link
2024-11	—	Building a Taiwanese Mandarin Spoken Language Model: A First Attempt	Paper
2024-11	Ultravox	Ultravox: An open-weight alternative to GPT-4o Realtime	Blog
2024-11	hertz-dev	blog	GitHub
2024-11	Freeze-Omni	Freeze-Omni: A Smart and Low Latency Speech-to-speech Dialogue Model with Frozen LLM	paper
2024-11	Align-SLM	Align-SLM: Textless Spoken Language Models with Reinforcement Learning from AI Feedback	paper
2024-10	Ichigo	Ichigo: Mixed-Modal Early-Fusion Realtime Voice Assistant	paper, code
2024-10	OmniFlatten	OmniFlatten: An End-to-end GPT Model for Seamless Voice Conversation	paper
2024-10	GPT-4o	GPT-4o System Card	paper
2024-10	Baichuan-OMNI	Baichuan-Omni Technical Report	paper
2024-10	GLM-4-Voice	GLM-4-Voice	GitHub
2024-10	—	Roadmap towards Superhuman Speech Understanding using Large Language Models	paper
2024-10	SALMONN-OMNI	SALMONN-OMNI: A SPEECH UNDERSTANDING AND GENERATION LLM IN A CODEC-FREE FULL-DUPLEX FRAMEWORK	paper
2024-10	Mini-Omni 2	Mini-Omni2: Towards Open-source GPT-4o with Vision, Speech and Duplex Capabilities	paper
2024-10	HALL-E	HALL-E: Hierarchical Neural Codec Language Model for Minute-Long Zero-Shot Text-to-Speech Synthesis	paper
2024-10	SyllableLM	SyllableLM: Learning Coarse Semantic Units for Speech Language Models	paper
2024-09	Moshi	Moshi: a speech-text foundation model for real-time dialogue	paper
2024-09	Takin AudioLLM	Takin: A Cohort of Superior Quality Zero-shot Speech Generation Models	paper
2024-09	FireRedTTS	FireRedTTS: A Foundation Text-To-Speech Framework for Industry-Level Generative Speech Applications	paper
2024-09	LLaMA-Omni	LLaMA-Omni: Seamless Speech Interaction with Large Language Models	paper
2024-09	MaskGCT	MaskGCT: Zero-Shot Text-to-Speech with Masked Generative Codec Transformer	paper
2024-09	SSR-Speech	SSR-Speech: Towards Stable, Safe and Robust Zero-shot Text-based Speech Editing and Synthesis	paper
2024-09	MoWE-Audio	MoWE-Audio: Multitask AudioLLMs with Mixture of Weak Encoders	paper
2024-08	Mini-Omni	Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming	paper
2024-08	Make-A-Voice 2	Make-A-Voice: Revisiting Voice Large Language Models as Scalable Multilingual and Multitask Learner	paper
2024-08	LSLM	Language Model Can Listen While Speaking	paper
2024-06	SimpleSpeech	SimpleSpeech: Towards Simple and Efficient Text-to-Speech with Scalar Latent Transformer Diffusion Models	paper
2024-06	UniAudio 1.5	UniAudio 1.5: Large Language Model-driven Audio Codec is A Few-shot Audio Task Learner	paper
2024-06	VALL-E R	VALL-E R: Robust and Efficient Zero-Shot Text-to-Speech Synthesis via Monotonic Alignment	paper
2024-06	VALL-E 2	VALL-E 2: Neural Codec Language Models are Human Parity Zero-Shot Text to Speech Synthesizers	paper
2024-06	GPST	Generative Pre-trained Speech Language Model with Efficient Hierarchical Transformer	paper
2024-04	CLaM-TTS	CLaM-TTS: Improving Neural Codec Language Model for Zero-Shot Text-to-Speech	paper
2024-04	RALL-E	RALL-E: Robust Codec Language Modeling with Chain-of-Thought Prompting for Text-to-Speech Synthesis	paper
2024-04	WavLLM	WavLLM: Towards Robust and Adaptive Speech Large Language Model	paper
2024-02	MobileSpeech	MobileSpeech: A Fast and High-Fidelity Framework for Mobile Zero-Shot Text-to-Speech	paper
2024-02	SLAM-ASR	An Embarrassingly Simple Approach for LLM with Strong ASR Capacity	paper
2024-02	AnyGPT	AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling	paper
2024-02	SpiRit-LM	SpiRit-LM: Interleaved Spoken and Written Language Model	paper
2024-02	USDM	Integrating Paralinguistics in Speech-Empowered Large Language Models for Natural Conversation	paper
2024-02	BAT	BAT: Learning to Reason about Spatial Sounds with Large Language Models	paper
2024-02	Audio Flamingo	Audio Flamingo: A Novel Audio Language Model with Few-Shot Learning and Dialogue Abilities	paper
2024-02	Text Description to speech	Natural language guidance of high-fidelity text-to-speech with synthetic annotations	paper
2024-02	GenTranslate	GenTranslate: Large Language Models are Generative Multilingual Speech and Machine Translators	paper
2024-02	Base-TTS	BASE TTS: Lessons from building a billion-parameter Text-to-Speech model on 100K hours of data	paper
2024-02	—	It’s Never Too Late: Fusing Acoustic Information into Large Language Models for Automatic Speech Recognition	paper
2024-01	—	Large Language Models are Efficient Learners of Noise-Robust Speech Recognition	paper
2024-01	ELLA-V	ELLA-V: Stable Neural Codec Language Modeling with Alignment-guided Sequence Reordering	paper
2023-12	Seamless	Seamless: Multilingual Expressive and Streaming Speech Translation	paper
2023-11	Qwen-Audio	Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models	paper
2023-10	LauraGPT	LauraGPT: Listen, Attend, Understand, and Regenerate Audio with GPT	paper
2023-10	SALMONN	SALMONN: Towards Generic Hearing Abilities for Large Language Models	paper
2023-10	UniAudio	UniAudio: An Audio Foundation Model Toward Universal Audio Generation	paper
2023-10	Whispering LLaMA	Whispering LLaMA: A Cross-Modal Generative Error Correction Framework for Speech Recognition	paper
2023-09	VoxtLM	Voxtlm: unified decoder-only models for consolidating speech recognition/synthesis and speech/text continuation tasks	paper
2023-09	LTU-AS	Joint Audio and Speech Understanding	paper
2023-09	SLM	SLM: Bridge the thin gap between speech and text foundation models	paper
2023-09	—	Generative Speech Recognition Error Correction with Large Language Models and Task-Activating Prompting	paper
2023-08	SpeechGen	SpeechGen: Unlocking the Generative Power of Speech Language Models with Prompts	paper
2023-08	SpeechX	SpeechX: Neural Codec Language Model as a Versatile Speech Transformer	paper
2023-08	LLaSM	Large Language and Speech Model	paper
2023-08	SeamlessM4T	Massively Multilingual & Multimodal Machine Translation	paper
2023-07	Speech-LLaMA	On decoder-only architecture for speech-to-text and large language model integration	paper
2023-07	LLM-ASR(temp.)	Prompting Large Language Models with Speech Recognition Abilities	paper
2023-06	AudioPaLM	AudioPaLM: A Large Language Model That Can Speak and Listen	paper
2023-05	Make-A-Voice	Make-A-Voice: Unified Voice Synthesis With Discrete Representation	paper
2023-05	Spectron	Spoken Question Answering and Speech Continuation Using Spectrogram-Powered LLM	paper
2023-05	TWIST	Textually Pretrained Speech Language Models	paper
2023-05	Pengi	Pengi: An Audio Language Model for Audio Tasks	paper
2023-05	SoundStorm	Efficient Parallel Audio Generation	paper
2023-05	LTU	Joint Audio and Speech Understanding	paper
2023-05	SpeechGPT	Empowering Large Language Models with Intrinsic Cross-Modal Conversational Abilities	paper
2023-05	VioLA	Unified Codec Language Models for Speech Recognition, Synthesis, and Translation	paper
2023-05	X-LLM	X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages	paper
2023-03	Google USM	Google USM: Scaling Automatic Speech Recognition Beyond 100 Languages	paper
2023-03	VALL-E X	Speak Foreign Languages with Your Own Voice: Cross-Lingual Neural Codec Language Modeling	paper
2023-02	SPEAR-TTS	Speak, Read and Prompt: High-Fidelity Text-to-Speech with Minimal Supervision	paper
2023-01	VALL-E	Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers	paper
2022-12	Whisper	Robust Speech Recognition via Large-Scale Weak Supervision	paper
2022-10	AudioGen	AudioGen: Textually Guided Audio Generation	paper
2022-09	AudioLM	AudioLM: a Language Modeling Approach to Audio Generation	paper
2022-05	Wav2Seq	Wav2Seq: Pre-training Speech-to-Text Encoder-Decoder Models Using Pseudo Languages	paper
2022-04	Unit mBART	Enhanced Direct Speech-to-Speech Translation Using Self-supervised Pre-training and Data Augmentation	paper
2022-03	d-GSLM	Generative Spoken Dialogue Language Modeling	paper
2021-10	SLAM	SLAM: A Unified Encoder for Speech and Language Modeling via Speech-Text Joint Pre-Training	paper
2021-09	p-GSLM	Text-Free Prosody-Aware Generative Spoken Language Modeling	paper
2021-02	GSLM	Generative Spoken Language Modeling from Raw Audio	paper

🔱 Speech/Audio Codec Models

Date	Model Name	Paper Title	Link
2024-11	PyramidCodec	PyramidCodec: Hierarchical Codec for Long-form Music Generation in Audio Domain	paper
2024-11	UniCodec	Universal Speech Token Learning Via Low-Bitrate Neural Codec and Pretrained Representations	paper
2024-11	SimVQ	Addressing Representation Collapse in Vector Quantized Models with One Linear Layer	paper
2024-11	MDCTCodec	MDCTCodec: A Lightweight MDCT-based Neural Audio Codec towards High Sampling Rate and Low Bitrate Scenarios	paper
2024-10	APCodec+	APCodec+: A Spectrum-Coding-Based High-Fidelity and High-Compression-Rate Neural Audio Codec with Staged Training Paradigm	paper
2024-10	–	A Closer Look at Neural Codec Resynthesis: Bridging the Gap between Codec and Waveform Generation	paper
2024-10	SNAC	SNAC: Multi-Scale Neural Audio Codec	paper
2024-10	LSCodec	LSCodec: Low-Bitrate and Speaker-Decoupled Discrete Speech Codec	paper
2024-10	Co-design for codec and codec-LM	TOWARDS CODEC-LM CO-DESIGN FOR NEURAL CODEC LANGUAGE MODELS	paper
2024-10	VChangeCodec	VChangeCodec: A High-efficiency Neural Speech Codec with Built-in Voice Changer for Real-time Communication	paper
2024-10	DC-Spin	DC-Spin: A Speaker-invariant Speech Tokenizer For Spoken Language Models	paper
2024-10	TAAE	Scaling Transformers for Low-Bitrate High-Quality Speech Coding	paper
2024-10	DM-Codec	DM-Codec: Distilling Multimodal Representations for Speech Tokenization	paper
2024-09	Mimi	Moshi: a speech-text foundation model for real-time dialogue	paper
2024-09	NDVQ	NDVQ: Robust Neural Audio Codec with Normal Distribution-Based Vector Quantization	paper
2024-09	SoCodec	SoCodec: A Semantic-Ordered Multi-Stream Speech Codec for Efficient Language Model Based Text-to-Speech Synthesis	paper
2024-09	BigCodec	BigCodec: Pushing the Limits of Low-Bitrate Neural Speech Codec	paper
2024-08	X-Codec	Codec Does Matter: Exploring the Semantic Shortcoming of Codec for Audio Language Model	paper
2024-08	WavTokenizer	WavTokenizer: an Efficient Acoustic Discrete Codec Tokenizer for Audio Language Modeling	paper
2024-07	Super-Codec	SuperCodec: A Neural Speech Codec with Selective Back-Projection Network	paper
2024-07	dMel	dMel: Speech Tokenization made Simple	paper
2024-06	CodecFake	CodecFake: Enhancing Anti-Spoofing Models Against Deepfake Audios from Codec-Based Speech Synthesis Systems	paper
2024-06	Single-Codec	Single-Codec: Single-Codebook Speech Codec towards High-Performance Speech Generation	paper
2024-06	SQ-Codec	SimpleSpeech: Towards Simple and Efficient Text-to-Speech with Scalar Latent Transformer Diffusion Models	paper
2024-06	PQ-VAE	Addressing Index Collapse of Large-Codebook Speech Tokenizer with Dual-Decoding Product-Quantized Variational Auto-Encoder	paper
2024-06	LLM-Codec	UniAudio 1.5: Large Language Model-driven Audio Codec is A Few-shot Audio Task Learner	paper
2024-05	HILCodec	HILCodec: High Fidelity and Lightweight Neural Audio Codec	paper
2024-04	SemantiCodec	SemantiCodec: An Ultra Low Bitrate Semantic Audio Codec for General Sound	paper
2024-04	PromptCodec	PromptCodec: High-Fidelity Neural Speech Codec using Disentangled Representation Learning based Adaptive Feature-aware Prompt Encoders	paper
2024-04	ESC	ESC: Efficient Speech Coding with Cross-Scale Residual Vector Quantized Transformers	paper
2024-03	FACodec	NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models	paper
2024-02	AP-Codec	APCodec: A Neural Audio Codec with Parallel Amplitude and Phase Spectrum Encoding and Decoding	paper
2024-02	Language-Codec	Language-Codec: Reducing the Gaps Between Discrete Codec Representation and Speech Language Models	paper
2024-01	ScoreDec	ScoreDec: A Phase-preserving High-Fidelity Audio Codec with A Generalized Score-based Diffusion Post-filter	paper
2023-11	HierSpeech++	HierSpeech++: Bridging the Gap between Semantic and Acoustic Representation of Speech by Hierarchical Variational Inference for Zero-shot Speech Synthesis	paper
2023-10	TiCodec	FEWER-TOKEN NEURAL SPEECH CODEC WITH TIME-INVARIANT CODES	paper
2023-09	RepCodec	RepCodec: A Speech Representation Codec for Speech Tokenization	paper
2023-09	FunCodec	FunCodec: A Fundamental, Reproducible and Integrable Open-source Toolkit for Neural Speech Codec	paper
2023-08	SpeechTokenizer	Speechtokenizer: Unified speech tokenizer for speech large language models	paper
2023-06	VOCOS	VOCOS: CLOSING THE GAP BETWEEN TIME-DOMAIN AND FOURIER-BASED NEURAL VOCODERS FOR HIGH-QUALITY AUDIO SYNTHESIS	paper
2023-06	Descript-audio-codec	High-Fidelity Audio Compression with Improved RVQGAN	paper
2023-05	AudioDec	Audiodec: An open-source streaming highfidelity neural audio codec	paper
2023-05	HiFi-Codec	Hifi-codec: Group-residual vector quantization for high fidelity audio codec	paper
2023-03	LMCodec	LMCodec: A Low Bitrate Speech Codec With Causal Transformer Models	paper
2022-11	Disen-TF-Codec	Disentangled Feature Learning for Real-Time Neural Speech Coding	paper
2022-10	EnCodec	High fidelity neural audio compression	paper
2022-07	S-TFNet	Cross-Scale Vector Quantization for Scalable Neural Speech Coding	paper
2022-01	TFNet	End-to-End Neural Speech Coding for Real-Time Communications	paper
2021-07	SoundStream	SoundStream: An End-to-End Neural Audio Codec	paper

Speech/Audio Representation Models

Date	Model Name	Paper Title	Link
2024-09	NEST-RQ	NEST-RQ: Next Token Prediction for Speech Self-Supervised Pre-Training	paper
2024-01	EAT	Self-Supervised Pre-Training with Efficient Audio Transformer	paper
2023-10	MR-HuBERT	Multi-resolution HuBERT: Multi-resolution Speech Self-Supervised Learning with Masked Unit Prediction	paper
2023-10	SpeechFlow	Generative Pre-training for Speech with Flow Matching	paper
2023-09	WavLabLM	Joint Prediction and Denoising for Large-scale Multilingual Self-supervised Learning	paper
2023-08	W2v-BERT 2.0	Massively Multilingual & Multimodal Machine Translation	paper
2023-07	Whisper-AT	Noise-Robust Automatic Speech Recognizers are Also Strong General Audio Event Taggers	paper
2023-06	ATST	Self-supervised Audio Teacher-Student Transformer for Both Clip-level and Frame-level Tasks	paper
2023-05	SPIN	Self-supervised Fine-tuning for Improved Content Representations by Speaker-invariant Clustering	paper
2023-05	DinoSR	Self-Distillation and Online Clustering for Self-supervised Speech Representation Learning	paper
2023-05	NFA	Self-supervised neural factor analysis for disentangling utterance-level speech representations	paper
2022-12	Data2vec 2.0	Efficient Self-supervised Learning with Contextualized Target Representations for Vision, Speech and Language	paper
2022-12	BEATs	Audio Pre-Training with Acoustic Tokenizers	paper
2022-11	MT4SSL	MT4SSL: Boosting Self-Supervised Speech Representation Learning by Integrating Multiple Targets	paper
2022-08	DINO	Non-contrastive self-supervised learning of utterance-level speech representations	paper
2022-07	Audio-MAE	Masked Autoencoders that Listen	paper
2022-04	MAESTRO	Matched Speech Text Representations through Modality Matching	paper
2022-03	MAE-AST	Masked Autoencoding Audio Spectrogram Transformer	paper
2022-03	LightHuBERT	Lightweight and Configurable Speech Representation Learning with Once-for-All Hidden-Unit BERT	paper
2022-02	Data2vec	A General Framework for Self-supervised Learning in Speech, Vision and Language	paper
2021-10	WavLM	WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing	paper
2021-08	W2v-BERT	Combining Contrastive Learning and Masked Language Modeling for Self-Supervised Speech Pre-Training	paper
2021-07	mHuBERT	Direct speech-to-speech translation with discrete units	paper
2021-06	HuBERT	Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units	paper
2021-03	BYOL-A	Self-Supervised Learning for General-Purpose Audio Representation	paper
2020-12	DeCoAR2.0	DeCoAR 2.0: Deep Contextualized Acoustic Representations with Vector Quantization	paper
2020-07	TERA	TERA: Self-Supervised Learning of Transformer Encoder Representation for Speech	paper
2020-06	Wav2vec2.0	wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations	paper
2019-10	APC	Generative Pre-Training for Speech with Autoregressive Predictive Coding	paper
2018-07	CPC	Representation Learning with Contrastive Predictive Coding	paper

🔱 Related Repository

发表评论取消回复