Awesome Speech LM Survey-语音大模型综述

在这个代码库中,我们研究了以下三个关键领域:(1) 表征学习,(2) 神经编解码器,以及 (3) 语言模型,这些领域共同推动了语音/音频大语言模型的发展。

  1. 语音表征模型:这些模型专注于学习语音的结构化表征,随后将其量化为离散的语音标记,通常被称为语义tokens
  2. 语音神经编解码模型:这些模型旨在学习语音和音频的离散标记,通常被称为声学tokens,同时保持良好的重构能力和低比特率。
  3. 语音大语言模型这些模型基于语音和声学token,采用语言建模方法进行训练,在语音理解和语音生成任务中展现出较高的能力。

Existing SpeechLMs

Model Title Url
OpenAI Advanced Voice Mode OpenAI Advanced Voice Mode Link
Claude Voice Mode Claude Voice Mode Link
MindGPT-4o-Audio 理想同学MindGPT-4o-Audio实时语音对话大模型发布 Link
VITA-Audio VITA-Audio: Fast Interleaved Cross-Modal Token Generation for Efficient Large Speech-Language Model Link
Voila Voila: Voice-Language Foundation Models for Real-Time Autonomous Interaction and Voice Role-Play Link
Kimi-Audio Kimi-Audio Technical Report Link
Lyra Lyra: An Efficient and Speech-Centric Framework for Omni-Cognition Link
Flow-Omni Continuous Speech Tokens Makes LLMs Robust Multi-Modality Learners Link
NTPP NTPP: Generative Speech Language Modeling for Dual-Channel Spoken Dialogue via Next-Token-Pair Prediction Link
Qwen2.5-Omni Qwen2.5-Omni Technical Report Link
CSM Conversational Speech Generation Model Link
Minmo MinMo: A Multimodal Large Language Model for Seamless Voice Interaction Link
Slamming Slamming: Training a Speech Language Model on One GPU in a Day Link
VITA-1.5 VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction Link
Baichuan-Audio Baichuan-Audio: A Unified Framework for End-to-End Speech Interaction Link
Step-Audio Step-Audio: Unified Understanding and Generation in Intelligent Speech Interaction Link
MiniCPM-o A GPT-4o Level MLLM for Vision, Speech and Multimodal Live Streaming on Your Phone Link
SyncLLM Beyond Turn-Based Interfaces: Synchronous LLMs as Full-Duplex Dialogue Agents Link
OmniFlatten OmniFlatten: An End-to-end GPT Model for Seamless Voice Conversation Link
SLAM-Omni SLAM-Omni: Timbre-Controllable Voice Interaction System with Single-Stage Training Link
GLM-4-Voice GLM-4-Voice: Towards Intelligent and Human-Like End-to-End Spoken Chatbot Link
Scaling Speech-Text Pre-training with Synthetic Interleaved Data Link
SALMONN-omni SALMONN-omni: A Codec-free LLM for Full-duplex Speech Understanding and Generation Link
Mini-Omni2 Mini-Omni2: Towards Open-source GPT-4o with Vision, Speech and Duplex Capabilities Link
Uniaudio Uniaudio: An audio foundation model toward universal audio generation Link
Parrot Parrot: Autoregressive Spoken Dialogue Language Modeling with Decoder-only Transformers Link
Moshi Moshi: a speech-text foundation model for real-time dialogue Link
Freeze-Omni Freeze-Omni: A Smart and Low Latency Speech-to-speech Dialogue Model with Frozen LLM Link
EMOVA EMOVA: Empowering Language Models to See, Hear and Speak with Vivid Emotions Link
IntrinsicVoice IntrinsicVoice: Empowering LLMs with Intrinsic Real-time Voice Interaction Abilities Link
LSLM Language Model Can Listen While Speaking Link
SpiRit-LM SpiRit-LM: Interleaved Spoken and Written Language Model Link
SpeechGPT-Gen SpeechGPT-Gen: Scaling Chain-of-Information Speech Generation Link
Spectron Spoken Question Answering and Speech Continuation Using Spectrogram-Powered LLM Link
SUTLM Toward Joint Language Modeling for Speech Units and Text Link
tGSLM Generative Spoken Language Model based on continuous word-sized audio tokens Link
LauraGPT LauraGPT: Listen, Attend, Understand, and Regenerate Audio with GPT Link
VoxtLM VoxtLM: Unified Decoder-Only Models for Consolidating Speech Recognition, Synthesis and Speech, Text Continuation Tasks Link
VITA VITA: Towards Open-Source Interactive Omni Multimodal LLM Link
FunAudioLLM FunAudioLLM: Voice Understanding and Generation Foundation Models for Natural Interaction Between Humans and LLMs Link
Voicebox Voicebox: Text-guided multilingual universal speech generation at scale Link
LLaMA-Omni LLaMA-Omni: Seamless Speech Interaction with Large Language Models Link
Mini-Omni Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming Link
TWIST Textually pretrained speech language models Link
GPST Generative pre-trained speech language model with efficient hierarchical transformer Link
AudioPaLM AudioPaLM: A Large Language Model That Can Speak and Listen Link
VioLA VioLA: Unified Codec Language Models for Speech Recognition, Synthesis, and Translation Link
SpeechGPT Speechgpt: Empowering large language models with intrinsic cross-modal conversational abilities Link
dGSLM Generative spoken dialogue language modeling Link
pGSLM Text-Free Prosody-Aware Generative Spoken Language Modeling Link
GSLM On generative spoken language modeling from raw audio Link

SpeechLM Tokenizers

Semantic Tokenizers

Name Title Url
Whisper Robust Speech Recognition via Large-Scale Weak Supervision Link
CosyVoice CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens Link
Google USM Google USM: Scaling Automatic Speech Recognition Beyond 100 Languages Link
WavLM WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing Link
HuBERT HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units Link
W2v-bert W2v-BERT: Combining Contrastive Learning and Masked Language Modeling for Self-Supervised Speech Pre-Training Link
Wav2vec 2.0 wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations Link

Acoustic Tokenizers

Name Title Url
WavTokenizer WavTokenizer: an Efficient Acoustic Discrete Codec Tokenizer for Audio Language Modeling Link
SNAC SNAC: Multi-Scale Neural Audio Codec Link
Encodec High Fidelity Neural Audio Compression Link
SoundStream SoundStream: An End-to-End Neural Audio Codec Link

Mixed Tokenizers

Name Title Url
SpeechTokenizer SpeechTokenizer: Unified Speech Tokenizer for Speech Large Language Models Link
Mimi Moshi: a speech-text foundation model for real-time dialogue Link

Popular Training Datasets

Dataset Type Phase Hours Year
LibriSpeech ASR Pre-Training 1k 2015
Multilingual LibriSpeech ASR Pre-Training 50.5k 2020
LibriLight ASR Pre-Training 60k 2019
People dataset ASR Pre-Training 30k 2021
VoxPopuli ASR Pre-Training 1.6k 2021
Gigaspeech ASR Pre-Training 40k 2021
Common Voice ASR Pre-Training 2.5k 2019
VCTK ASR Pre-Training 0.3k 2017
WenetSpeech ASR Pre-Training 22k 2022
LibriTTS TTS Pre-Training 0.6k 2019
CoVoST2 S2TT Pre-Training 2.8k 2020
CVSS S2ST Pre-Training 1.9k 2022
VoxCeleb Speaker Identification Pre-Training 0.4k 2017
VoxCeleb2 Speaker Identification Pre-Training 2.4k 2018
Spotify Podcasts Podcast Pre-Training 47k 2020
Fisher Telephone conversation Pre-Training 2k 2004
SpeechInstruct Instruction-following Instruction-Tuning 2023
InstructS2S-200K Instruction-following Instruction-Tuning 2024
VoiceAssistant-400K Instruction-following Instruction-Tuning 2024

Evaluation Benchmarks

Name Eval Type # Tasks Audio Type I/O
ABX Representation 1 Speech A→−
sWUGGY Linguistic 1 Speech A→−
sBLIMP Linguistic 1 Speech A→−
sStoryCloze Linguistic 1 Speech A/T→−
STSP Paralinguistic 1 Speech A/T→A/T
MMAU Downstream 27 Speech, Sound, Music A→T
Audiobench Downstream 8 Speech, Sound A→T
AIR-Bench Downstream 20 Speech, Sound, Music A→T
SD-Eval Downstream 4 Speech A→T
SUPERB Downstream 10 Speech A→T
Dynamic-SUPERB Downstream 180 Speech, Sound, Music A→T
SALMON Downstream 8 Speech A→−
VoiceBench Downstream 8 Speech A→A
VoxEval Downstream 56 Speech A→A

🔱 Speech/Audio Language Models

Date Model Name Paper Title Link
2024-11 Building a Taiwanese Mandarin Spoken Language Model: A First Attempt Paper
2024-11 Ultravox Ultravox: An open-weight alternative to GPT-4o Realtime Blog
2024-11 hertz-dev blog GitHub
2024-11 Freeze-Omni Freeze-Omni: A Smart and Low Latency Speech-to-speech Dialogue Model with Frozen LLM paper
2024-11 Align-SLM Align-SLM: Textless Spoken Language Models with Reinforcement Learning from AI Feedback paper
2024-10 Ichigo Ichigo: Mixed-Modal Early-Fusion Realtime Voice Assistant papercode
2024-10 OmniFlatten OmniFlatten: An End-to-end GPT Model for Seamless Voice Conversation paper
2024-10 GPT-4o GPT-4o System Card paper
2024-10 Baichuan-OMNI Baichuan-Omni Technical Report paper
2024-10 GLM-4-Voice GLM-4-Voice GitHub
2024-10 Roadmap towards Superhuman Speech Understanding using Large Language Models paper
2024-10 SALMONN-OMNI SALMONN-OMNI: A SPEECH UNDERSTANDING AND GENERATION LLM IN A CODEC-FREE FULL-DUPLEX FRAMEWORK paper
2024-10 Mini-Omni 2 Mini-Omni2: Towards Open-source GPT-4o with Vision, Speech and Duplex Capabilities paper
2024-10 HALL-E HALL-E: Hierarchical Neural Codec Language Model for Minute-Long Zero-Shot Text-to-Speech Synthesis paper
2024-10 SyllableLM SyllableLM: Learning Coarse Semantic Units for Speech Language Models paper
2024-09 Moshi Moshi: a speech-text foundation model for real-time dialogue paper
2024-09 Takin AudioLLM Takin: A Cohort of Superior Quality Zero-shot Speech Generation Models paper
2024-09 FireRedTTS FireRedTTS: A Foundation Text-To-Speech Framework for Industry-Level Generative Speech Applications paper
2024-09 LLaMA-Omni LLaMA-Omni: Seamless Speech Interaction with Large Language Models paper
2024-09 MaskGCT MaskGCT: Zero-Shot Text-to-Speech with Masked Generative Codec Transformer paper
2024-09 SSR-Speech SSR-Speech: Towards Stable, Safe and Robust Zero-shot Text-based Speech Editing and Synthesis paper
2024-09 MoWE-Audio MoWE-Audio: Multitask AudioLLMs with Mixture of Weak Encoders paper
2024-08 Mini-Omni Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming paper
2024-08 Make-A-Voice 2 Make-A-Voice: Revisiting Voice Large Language Models as Scalable Multilingual and Multitask Learner paper
2024-08 LSLM Language Model Can Listen While Speaking paper
2024-06 SimpleSpeech SimpleSpeech: Towards Simple and Efficient Text-to-Speech with Scalar Latent Transformer Diffusion Models paper
2024-06 UniAudio 1.5 UniAudio 1.5: Large Language Model-driven Audio Codec is A Few-shot Audio Task Learner paper
2024-06 VALL-E R VALL-E R: Robust and Efficient Zero-Shot Text-to-Speech Synthesis via Monotonic Alignment paper
2024-06 VALL-E 2 VALL-E 2: Neural Codec Language Models are Human Parity Zero-Shot Text to Speech Synthesizers paper
2024-06 GPST Generative Pre-trained Speech Language Model with Efficient Hierarchical Transformer paper
2024-04 CLaM-TTS CLaM-TTS: Improving Neural Codec Language Model for Zero-Shot Text-to-Speech paper
2024-04 RALL-E RALL-E: Robust Codec Language Modeling with Chain-of-Thought Prompting for Text-to-Speech Synthesis paper
2024-04 WavLLM WavLLM: Towards Robust and Adaptive Speech Large Language Model paper
2024-02 MobileSpeech MobileSpeech: A Fast and High-Fidelity Framework for Mobile Zero-Shot Text-to-Speech paper
2024-02 SLAM-ASR An Embarrassingly Simple Approach for LLM with Strong ASR Capacity paper
2024-02 AnyGPT AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling paper
2024-02 SpiRit-LM SpiRit-LM: Interleaved Spoken and Written Language Model paper
2024-02 USDM Integrating Paralinguistics in Speech-Empowered Large Language Models for Natural Conversation paper
2024-02 BAT BAT: Learning to Reason about Spatial Sounds with Large Language Models paper
2024-02 Audio Flamingo Audio Flamingo: A Novel Audio Language Model with Few-Shot Learning and Dialogue Abilities paper
2024-02 Text Description to speech Natural language guidance of high-fidelity text-to-speech with synthetic annotations paper
2024-02 GenTranslate GenTranslate: Large Language Models are Generative Multilingual Speech and Machine Translators paper
2024-02 Base-TTS BASE TTS: Lessons from building a billion-parameter Text-to-Speech model on 100K hours of data paper
2024-02 It’s Never Too Late: Fusing Acoustic Information into Large Language Models for Automatic Speech Recognition paper
2024-01 Large Language Models are Efficient Learners of Noise-Robust Speech Recognition paper
2024-01 ELLA-V ELLA-V: Stable Neural Codec Language Modeling with Alignment-guided Sequence Reordering paper
2023-12 Seamless Seamless: Multilingual Expressive and Streaming Speech Translation paper
2023-11 Qwen-Audio Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models paper
2023-10 LauraGPT LauraGPT: Listen, Attend, Understand, and Regenerate Audio with GPT paper
2023-10 SALMONN SALMONN: Towards Generic Hearing Abilities for Large Language Models paper
2023-10 UniAudio UniAudio: An Audio Foundation Model Toward Universal Audio Generation paper
2023-10 Whispering LLaMA Whispering LLaMA: A Cross-Modal Generative Error Correction Framework for Speech Recognition paper
2023-09 VoxtLM Voxtlm: unified decoder-only models for consolidating speech recognition/synthesis and speech/text continuation tasks paper
2023-09 LTU-AS Joint Audio and Speech Understanding paper
2023-09 SLM SLM: Bridge the thin gap between speech and text foundation models paper
2023-09 Generative Speech Recognition Error Correction with Large Language Models and Task-Activating Prompting paper
2023-08 SpeechGen SpeechGen: Unlocking the Generative Power of Speech Language Models with Prompts paper
2023-08 SpeechX SpeechX: Neural Codec Language Model as a Versatile Speech Transformer paper
2023-08 LLaSM Large Language and Speech Model paper
2023-08 SeamlessM4T Massively Multilingual & Multimodal Machine Translation paper
2023-07 Speech-LLaMA On decoder-only architecture for speech-to-text and large language model integration paper
2023-07 LLM-ASR(temp.) Prompting Large Language Models with Speech Recognition Abilities paper
2023-06 AudioPaLM AudioPaLM: A Large Language Model That Can Speak and Listen paper
2023-05 Make-A-Voice Make-A-Voice: Unified Voice Synthesis With Discrete Representation paper
2023-05 Spectron Spoken Question Answering and Speech Continuation Using Spectrogram-Powered LLM paper
2023-05 TWIST Textually Pretrained Speech Language Models paper
2023-05 Pengi Pengi: An Audio Language Model for Audio Tasks paper
2023-05 SoundStorm Efficient Parallel Audio Generation paper
2023-05 LTU Joint Audio and Speech Understanding paper
2023-05 SpeechGPT Empowering Large Language Models with Intrinsic Cross-Modal Conversational Abilities paper
2023-05 VioLA Unified Codec Language Models for Speech Recognition, Synthesis, and Translation paper
2023-05 X-LLM X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages paper
2023-03 Google USM Google USM: Scaling Automatic Speech Recognition Beyond 100 Languages paper
2023-03 VALL-E X Speak Foreign Languages with Your Own Voice: Cross-Lingual Neural Codec Language Modeling paper
2023-02 SPEAR-TTS Speak, Read and Prompt: High-Fidelity Text-to-Speech with Minimal Supervision paper
2023-01 VALL-E Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers paper
2022-12 Whisper Robust Speech Recognition via Large-Scale Weak Supervision paper
2022-10 AudioGen AudioGen: Textually Guided Audio Generation paper
2022-09 AudioLM AudioLM: a Language Modeling Approach to Audio Generation paper
2022-05 Wav2Seq Wav2Seq: Pre-training Speech-to-Text Encoder-Decoder Models Using Pseudo Languages paper
2022-04 Unit mBART Enhanced Direct Speech-to-Speech Translation Using Self-supervised Pre-training and Data Augmentation paper
2022-03 d-GSLM Generative Spoken Dialogue Language Modeling paper
2021-10 SLAM SLAM: A Unified Encoder for Speech and Language Modeling via Speech-Text Joint Pre-Training paper
2021-09 p-GSLM Text-Free Prosody-Aware Generative Spoken Language Modeling paper
2021-02 GSLM Generative Spoken Language Modeling from Raw Audio paper

🔱 Speech/Audio Codec Models

Date Model Name Paper Title Link
2024-11 PyramidCodec PyramidCodec: Hierarchical Codec for Long-form Music Generation in Audio Domain paper
2024-11 UniCodec Universal Speech Token Learning Via Low-Bitrate Neural Codec and Pretrained Representations paper
2024-11 SimVQ Addressing Representation Collapse in Vector Quantized Models with One Linear Layer paper
2024-11 MDCTCodec MDCTCodec: A Lightweight MDCT-based Neural Audio Codec towards High Sampling Rate and Low Bitrate Scenarios paper
2024-10 APCodec+ APCodec+: A Spectrum-Coding-Based High-Fidelity and High-Compression-Rate Neural Audio Codec with Staged Training Paradigm paper
2024-10 A Closer Look at Neural Codec Resynthesis: Bridging the Gap between Codec and Waveform Generation paper
2024-10 SNAC SNAC: Multi-Scale Neural Audio Codec paper
2024-10 LSCodec LSCodec: Low-Bitrate and Speaker-Decoupled Discrete Speech Codec paper
2024-10 Co-design for codec and codec-LM TOWARDS CODEC-LM CO-DESIGN FOR NEURAL CODEC LANGUAGE MODELS paper
2024-10 VChangeCodec VChangeCodec: A High-efficiency Neural Speech Codec with Built-in Voice Changer for Real-time Communication paper
2024-10 DC-Spin DC-Spin: A Speaker-invariant Speech Tokenizer For Spoken Language Models paper
2024-10 TAAE Scaling Transformers for Low-Bitrate High-Quality Speech Coding paper
2024-10 DM-Codec DM-Codec: Distilling Multimodal Representations for Speech Tokenization paper
2024-09 Mimi Moshi: a speech-text foundation model for real-time dialogue paper
2024-09 NDVQ NDVQ: Robust Neural Audio Codec with Normal Distribution-Based Vector Quantization paper
2024-09 SoCodec SoCodec: A Semantic-Ordered Multi-Stream Speech Codec for Efficient Language Model Based Text-to-Speech Synthesis paper
2024-09 BigCodec BigCodec: Pushing the Limits of Low-Bitrate Neural Speech Codec paper
2024-08 X-Codec Codec Does Matter: Exploring the Semantic Shortcoming of Codec for Audio Language Model paper
2024-08 WavTokenizer WavTokenizer: an Efficient Acoustic Discrete Codec Tokenizer for Audio Language Modeling paper
2024-07 Super-Codec SuperCodec: A Neural Speech Codec with Selective Back-Projection Network paper
2024-07 dMel dMel: Speech Tokenization made Simple paper
2024-06 CodecFake CodecFake: Enhancing Anti-Spoofing Models Against Deepfake Audios from Codec-Based Speech Synthesis Systems paper
2024-06 Single-Codec Single-Codec: Single-Codebook Speech Codec towards High-Performance Speech Generation paper
2024-06 SQ-Codec SimpleSpeech: Towards Simple and Efficient Text-to-Speech with Scalar Latent Transformer Diffusion Models paper
2024-06 PQ-VAE Addressing Index Collapse of Large-Codebook Speech Tokenizer with Dual-Decoding Product-Quantized Variational Auto-Encoder paper
2024-06 LLM-Codec UniAudio 1.5: Large Language Model-driven Audio Codec is A Few-shot Audio Task Learner paper
2024-05 HILCodec HILCodec: High Fidelity and Lightweight Neural Audio Codec paper
2024-04 SemantiCodec SemantiCodec: An Ultra Low Bitrate Semantic Audio Codec for General Sound paper
2024-04 PromptCodec PromptCodec: High-Fidelity Neural Speech Codec using Disentangled Representation Learning based Adaptive Feature-aware Prompt Encoders paper
2024-04 ESC ESC: Efficient Speech Coding with Cross-Scale Residual Vector Quantized Transformers paper
2024-03 FACodec NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models paper
2024-02 AP-Codec APCodec: A Neural Audio Codec with Parallel Amplitude and Phase Spectrum Encoding and Decoding paper
2024-02 Language-Codec Language-Codec: Reducing the Gaps Between Discrete Codec Representation and Speech Language Models paper
2024-01 ScoreDec ScoreDec: A Phase-preserving High-Fidelity Audio Codec with A Generalized Score-based Diffusion Post-filter paper
2023-11 HierSpeech++ HierSpeech++: Bridging the Gap between Semantic and Acoustic Representation of Speech by Hierarchical Variational Inference for Zero-shot Speech Synthesis paper
2023-10 TiCodec FEWER-TOKEN NEURAL SPEECH CODEC WITH TIME-INVARIANT CODES paper
2023-09 RepCodec RepCodec: A Speech Representation Codec for Speech Tokenization paper
2023-09 FunCodec FunCodec: A Fundamental, Reproducible and Integrable Open-source Toolkit for Neural Speech Codec paper
2023-08 SpeechTokenizer Speechtokenizer: Unified speech tokenizer for speech large language models paper
2023-06 VOCOS VOCOS: CLOSING THE GAP BETWEEN TIME-DOMAIN AND FOURIER-BASED NEURAL VOCODERS FOR HIGH-QUALITY AUDIO SYNTHESIS paper
2023-06 Descript-audio-codec High-Fidelity Audio Compression with Improved RVQGAN paper
2023-05 AudioDec Audiodec: An open-source streaming highfidelity neural audio codec paper
2023-05 HiFi-Codec Hifi-codec: Group-residual vector quantization for high fidelity audio codec paper
2023-03 LMCodec LMCodec: A Low Bitrate Speech Codec With Causal Transformer Models paper
2022-11 Disen-TF-Codec Disentangled Feature Learning for Real-Time Neural Speech Coding paper
2022-10 EnCodec High fidelity neural audio compression paper
2022-07 S-TFNet Cross-Scale Vector Quantization for Scalable Neural Speech Coding paper
2022-01 TFNet End-to-End Neural Speech Coding for Real-Time Communications paper
2021-07 SoundStream SoundStream: An End-to-End Neural Audio Codec paper

Speech/Audio Representation Models

Date Model Name Paper Title Link
2024-09 NEST-RQ NEST-RQ: Next Token Prediction for Speech Self-Supervised Pre-Training paper
2024-01 EAT Self-Supervised Pre-Training with Efficient Audio Transformer paper
2023-10 MR-HuBERT Multi-resolution HuBERT: Multi-resolution Speech Self-Supervised Learning with Masked Unit Prediction paper
2023-10 SpeechFlow Generative Pre-training for Speech with Flow Matching paper
2023-09 WavLabLM Joint Prediction and Denoising for Large-scale Multilingual Self-supervised Learning paper
2023-08 W2v-BERT 2.0 Massively Multilingual & Multimodal Machine Translation paper
2023-07 Whisper-AT Noise-Robust Automatic Speech Recognizers are Also Strong General Audio Event Taggers paper
2023-06 ATST Self-supervised Audio Teacher-Student Transformer for Both Clip-level and Frame-level Tasks paper
2023-05 SPIN Self-supervised Fine-tuning for Improved Content Representations by Speaker-invariant Clustering paper
2023-05 DinoSR Self-Distillation and Online Clustering for Self-supervised Speech Representation Learning paper
2023-05 NFA Self-supervised neural factor analysis for disentangling utterance-level speech representations paper
2022-12 Data2vec 2.0 Efficient Self-supervised Learning with Contextualized Target Representations for Vision, Speech and Language paper
2022-12 BEATs Audio Pre-Training with Acoustic Tokenizers paper
2022-11 MT4SSL MT4SSL: Boosting Self-Supervised Speech Representation Learning by Integrating Multiple Targets paper
2022-08 DINO Non-contrastive self-supervised learning of utterance-level speech representations paper
2022-07 Audio-MAE Masked Autoencoders that Listen paper
2022-04 MAESTRO Matched Speech Text Representations through Modality Matching paper
2022-03 MAE-AST Masked Autoencoding Audio Spectrogram Transformer paper
2022-03 LightHuBERT Lightweight and Configurable Speech Representation Learning with Once-for-All Hidden-Unit BERT paper
2022-02 Data2vec A General Framework for Self-supervised Learning in Speech, Vision and Language paper
2021-10 WavLM WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing paper
2021-08 W2v-BERT Combining Contrastive Learning and Masked Language Modeling for Self-Supervised Speech Pre-Training paper
2021-07 mHuBERT Direct speech-to-speech translation with discrete units paper
2021-06 HuBERT Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units paper
2021-03 BYOL-A Self-Supervised Learning for General-Purpose Audio Representation paper
2020-12 DeCoAR2.0 DeCoAR 2.0: Deep Contextualized Acoustic Representations with Vector Quantization paper
2020-07 TERA TERA: Self-Supervised Learning of Transformer Encoder Representation for Speech paper
2020-06 Wav2vec2.0 wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations paper
2019-10 APC Generative Pre-Training for Speech with Autoregressive Predictive Coding paper
2018-07 CPC Representation Learning with Contrastive Predictive Coding paper

🔱 Related Repository

发表评论

您的电子邮箱地址不会被公开。 必填项已用*标注