Semantic language models: 基于semantic token的自回归模型,常外接一个unit-vocoder来恢复语音,比如SpeechGPT。这类模型虽然可以完成一些语音内容相关的任务,但是它们产生的音质比较一般,并且无法完成一些副语言学相关的任务,比如音色转换等。
Acoustic language models: 基于acoustic token的speech LM,比如VALL-E。这类模型产生的语音音质比较好,并且可以较好地完成一些比如zero-shot TTS的任务,但是会存在内容不准确的问题。
Hierarchical speech language models: 这类模型由Semantic language models和Acoustic language models 级联而成,既可以产生比较精确的内容,也可以产生较好的音质,比如AudioPaLM。但是这类模型,建模阶段太多,较为复杂,需要两种tokenizer的参与;而且在semantic token和acoustic token之间其实存在有很大的信息冗余,会带来一些不必要的建模难度。
文章提出了一个又快又准的并行transformer模型,可以克服上面提到的两个挑战。首先,不像前面的基于CTC的工作,作者提出了使用基于CIF【continuous integrate-and-fire】的predictor网络评估目标长度并产生隐变量。对于第二个挑战,作者设计了基于GLM【glancing language mode】的sampler模块增强非自回归解码器对输出上下文的建模能力。这个工作受到了机器翻译工作的启发。作者另外设计了一个包含负例的策略,利用MWER损失指导模型学习提升模型性能。
直接训练完全并行生成来学习目标语句中词之间的依赖关系对模型并不友好。一种更为简单有效的依赖关系学习方式是根据部分输入词预测其余目标词。但是这种学习方式需要部分目标词作为输入,不符合非自回归模型并行生成的要求。作者观察到随着模型自身更好地学习到词之间的依赖关系,模型对于依赖关系的学习可以逐渐摆脱使用目标语句部分词作为输入的需求。基于以上观察,Glancing Transformer(GLAT)利用了一种 glancing language model 的方法,通过渐进学习的方式进行词之间依赖关系的建模。在渐进学习的过程中,模型会先学习并行输出一些较为简单的语句片段,然后逐渐学习整句话的单步并行生成。
[18] L. Dong and B. Xu, “CIF: Continuous integrate-and-fire for end-to-end speech recognition,” in ICASSP 2020-2020 IEEE Interna-tional Conference on Acoustics, Speech and Signal Processing(ICASSP). IEEE, 2020, pp. 6079–6083. [19] L. Qian, H. Zhou, Y. Bao, M. Wang, L. Qiu, W. Zhang, Y. Yu,and L. Li, “Glancing transformer for non-autoregressive neural machine translation,” arXiv preprint arXiv:2008.07905, 2020. [20] R. Prabhavalkar, T. N. Sainath, Y. Wu, P. Nguyen, Z. Chen, C.-C. Chiu, and A. Kannan, “Minimum word error rate training for attention-based sequence-to-sequence models,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 4839–4843
###################################
# Sample a speaker from Gaussian.
rand_spk = chat.sample_random_speaker()
print(rand_spk) # save it for later timbre recovery
params_infer_code = ChatTTS.Chat.InferCodeParams(
spk_emb = rand_spk, # add sampled speaker
temperature = .3, # using custom temperature
top_P = 0.7, # top P decode
top_K = 20, # top K decode
)
###################################
# For sentence level manual control.
# use oral_(0-9), laugh_(0-2), break_(0-7)
# to generate special token in text to synthesize.
params_refine_text = ChatTTS.Chat.RefineTextParams(
prompt='[oral_2][laugh_0][break_6]',
)
wavs = chat.infer(
texts,
params_refine_text=params_refine_text,
params_infer_code=params_infer_code,
)
###################################
# For word level manual control.
text = 'What is [uv_break]your favorite english food?[laugh][lbreak]'
wavs = chat.infer(text, skip_refine_text=True, params_refine_text=params_refine_text, params_infer_code=params_infer_code)
torchaudio.save("output2.wav", torch.from_numpy(wavs[0]), 24000)
S and E denote the start and end of sequence, respectively.T is “turn of speech” tokens. 𝐯 is a speaker embedding vector extracted from the speech X with a pre-trained voice-print model2. The text encodings Y¯={𝐲¯u}u∈[1:U] is obtained by passing the text through a Byte Pair Encoded (BPE) tokenizer and text encoder:
在语音识别中,我们的数据集是音频文件和其对应的文本,不幸的是,音频文件和文本很难在单词的单位上对齐。除了语言识别,在OCR,机器翻译中,都存在类似的Sequence to Sequence结构,同样也需要在预处理操作时进行对齐,但是这种对齐有时候是非常困难的。如果不使用对齐而直接训练模型时,由于人的语速的不同,或者字符间距离的不同,导致模型很难收敛。
当我们主要关注文本和语音模态时,GPT-4o其实就是一个语音语言模型(speech language model, SLM)。该SLM同时具备语音理解能力和语音合成能力,输入端和输出端均支持文本和语音的混合多模态。那么,这一SLM应该如何实现呢?在大语言模型(large language model, LLM)滥觞的今日,不难想到这样一种方法:将连续的语音数据离散化成如同单词(或者称token,词元)一样的表示,并入到LLM的词表中,再走一遍训练LLM的老路。
audio & text tokenizer的实现应该是语音离散化部分所用的技术,例如SoundStream、Encodec、SpeechTokenizer,或者是MEL+VQ最后配合声码器来解码;参考zero-shot TTS、AudioLM/AudioPaLM、SpeechGPT-Gen等工作的结果,LLM中语音token的解码应该是要走层次化或者多步的方法,先解码语义特征,再解码声学特征,或者是先解码MEL,再加一个HIFIGAN这样的声码器。另外,如果做audio/speech/music这样的通用声合成的话,可能也能通过prompt来控制。AudioLDM2虽然做了这方面的工作,但audio/music和speech的参数其实是不一样的,说到底还不是同一个模型。
[1] Baevski A, Zhou Y, Mohamed A, et al. wav2vec 2.0: A framework for self-supervised learning of speech representations[J]. Advances in neural information processing systems, 2020, 33: 12449-12460.
[2] Hsu W N, Bolte B, Tsai Y H H, et al. Hubert: Self-supervised speech representation learning by masked prediction of hidden units[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2021, 29: 3451-3460.
[3] Chung Y A, Zhang Y, Han W, et al. W2v-bert: Combining contrastive learning and masked language modeling for self-supervised speech pre-training[C]//2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, 2021: 244-250.
[4] Van Den Oord A, Vinyals O. Neural discrete representation learning[J]. Advances in neural information processing systems, 2017, 30.
[5] Zeghidour N, Luebs A, Omran A, et al. Soundstream: An end-to-end neural audio codec[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2021, 30: 495-507.
[6] Défossez A, Copet J, Synnaeve G, et al. High fidelity neural audio compression[J]. arXiv preprint arXiv:2210.13438, 2022.
[7] Zhang X, Zhang D, Li S, et al. Speechtokenizer: Unified speech tokenizer for speech large language models[J]. arXiv preprint arXiv:2308.16692, 2023.
[8] Borsos Z, Marinier R, Vincent D, et al. Audiolm: a language modeling approach to audio generation[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2023.
[9] Rubenstein P K, Asawaroengchai C, Nguyen D D, et al. Audiopalm: A large language model that can speak and listen[J]. arXiv preprint arXiv:2306.12925, 2023.
[10] Changli Tang, Wenyi Yu, Guangzhi Sun, Xianzhao Chen, Tian Tan, Wei Li, Lu Lu, Zejun Ma, Chao Zhang. SALMONN: Towards Generic Hearing Abilities for Large Language Models
[11] Zhang D, Li S, Zhang X, et al. Speechgpt: Empowering large language models with intrinsic cross-modal conversational abilities[J]. arXiv preprint arXiv:2305.11000, 2023.
[16] Wang C, Chen S, Wu Y, et al. Neural codec language models are zero-shot text to speech synthesizers[J]. arXiv preprint arXiv:2301.02111, 2023.
[17] Anil R, Dai A M, Firat O, et al. Palm 2 technical report[J]. arXiv preprint arXiv:2305.10403, 2023.
[18] Lee Y, Yeon I, Nam J, et al. VoiceLDM: Text-to-Speech with Environmental Context[C]//ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2024: 12566-12571.
[19] Lyth D, King S. Natural language guidance of high-fidelity text-to-speech with synthetic annotations[J]. arXiv preprint arXiv:2402.01912, 2024.
[20] Betker J. Better speech synthesis through scaling[J]. arXiv preprint arXiv:2305.07243, 2023.
[21] Xin D, Tan X, Shen K, et al. RALL-E: Robust Codec Language Modeling with Chain-of-Thought Prompting for Text-to-Speech Synthesis[J]. arXiv preprint arXiv:2404.03204, 2024.
[22] Wang C, Zeng C, Zhang B, et al. HAM-TTS: Hierarchical Acoustic Modeling for Token-Based Zero-Shot Text-to-Speech with Model and Data Scaling[J]. arXiv preprint arXiv:2403.05989, 2024.
[23] Ren Y, Hu C, Tan X, et al. Fastspeech 2: Fast and high-quality end-to-end text to speech[J]. arXiv preprint arXiv:2006.04558, 2020.
[24] Rombach R, Blattmann A, Lorenz D, et al. High-resolution image synthesis with latent diffusion models[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022: 10684-10695.
[25] Shen K, Ju Z, Tan X, et al. Naturalspeech 2: Latent diffusion models are natural and zero-shot speech and singing synthesizers[J]. arXiv preprint arXiv:2304.09116, 2023.
[26] Ju Z, Wang Y, Shen K, et al. NaturalSpeech 3: Zero-shot speech synthesis with factorized codec and diffusion models[J]. arXiv preprint arXiv:2403.03100, 2024.
[27] Liu H, Tian Q, Yuan Y, et al. AudioLDM 2: Learning holistic audio generation with self-supervised pretraining[J]. arXiv preprint arXiv:2308.05734, 2023.
[28] Jiang Z, Ren Y, Ye Z, et al. Mega-tts: Zero-shot text-to-speech at scale with intrinsic inductive bias[J]. arXiv preprint arXiv:2306.03509, 2023.
[29] Jiang Z, Liu J, Ren Y, et al. Mega-tts 2: Zero-shot text-to-speech with arbitrary length speech prompts[J]. arXiv preprint arXiv:2307.07218, 2023.
[30] Łajszczak M, Cámbara G, Li Y, et al. BASE TTS: Lessons from building a billion-parameter text-to-speech model on 100K hours of data[J]. arXiv preprint arXiv:2402.08093, 2024.
[31] Li Y A, Han C, Mesgarani N. Styletts: A style-based generative model for natural and diverse text-to-speech synthesis[J]. arXiv preprint arXiv:2205.15439, 2022.
[32] Li Y A, Han C, Raghavan V, et al. Styletts 2: Towards human-level text-to-speech through style diffusion and adversarial training with large speech language models[J]. Advances in Neural Information Processing Systems, 2024, 36.
[33] Guo Z, Leng Y, Wu Y, et al. Prompttts: Controllable text-to-speech with text descriptions[C]//ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023: 1-5.
[34] Yang D, Liu S, Huang R, et al. Instructtts: Modelling expressive TTS in discrete latent space with natural language style prompt[J]. arXiv preprint arXiv:2301.13662, 2023.
[35] Vyas A, Shi B, Le M, et al. Audiobox: Unified audio generation with natural language prompts[J]. arXiv preprint arXiv:2312.15821, 2023.
[36] Lee S H, Choi H Y, Kim S B, et al. HierSpeech++: Bridging the Gap between Semantic and Acoustic Representation of Speech by Hierarchical Variational Inference for Zero-shot Speech Synthesis[J]. arXiv preprint arXiv:2311.12454, 2023.
[37] Yang D, Tian J, Tan X, et al. Uniaudio: An audio foundation model toward universal audio generation[J]. arXiv preprint arXiv:2310.00704, 2023.
[38] Huang R, Zhang C, Wang Y, et al. Make-a-voice: Unified voice synthesis with discrete representation[J]. arXiv preprint arXiv:2305.19269, 2023.
Torchaudio is a library for audio and signal processing with PyTorch. It provides I/O, signal and data processing functions, datasets, model implementations and application components.
Recent breakthroughs in generative modeling of images have been predicated on the availability of high-quality and large-scale datasebts such as MNIST, CIFAR and ImageNet. We recognized the need for an audio dataset that was as approachable as those in the image domain.
Audio signals found in the wild contain multi-scale dependencies that prove particularly difficult to model, leading many previous efforts at data-driven audio synthesis to focus on more constrained domains such as texture synthesis or training small parametric models.
We encourage the broader community to use NSynth as a benchmark and entry point into audio machine learning. We also view NSynth as a building block for future datasets and envision a high-quality multi-note dataset for tasks like generation and transcription that involve learning complex language-like dependencies.
Description
NSynth is an audio dataset containing 305,979 musical notes, each with a unique pitch, timbre, and envelope. For 1,006 instruments from commercial sample libraries, we generated four second, monophonic 16kHz audio snippets, referred to as notes, by ranging over every pitch of a standard MIDI pian o (21-108) as well as five different velocities (25, 50, 75, 100, 127). The note was held for the first three seconds and allowed to decay for the final second.
Some instruments are not capable of producing all 88 pitches in this range, resulting in an average of 65.4 pitches per instrument. Furthermore, the commercial sample packs occasionally contain duplicate sounds across multiple velocities, leaving an average of 4.75 unique velocities per pitch.
We also annotated each of the notes with three additional pieces of information based on a combination of human evaluation and heuristic algorithms:
Source: The method of sound production for the note’s instrument. This can be one of acoustic or electronic for instruments that were recorded from acoustic or electronic instruments, respectively, or synthetic for synthesized instruments. See their frequencies below.
Family: The high-level family of which the note’s instrument is a member. Each instrument is a member of exactly one family. See the complete list and their frequencies below.
Qualities: Sonic qualities of the note. See the quality descriptions and their co-occurrences below. Each note is annotated with zero or more qualities.
Format
Files
The NSynth dataset can be download in two formats:
Train [tfrecord | json/wav]: A training set with 289,205 examples. Instruments do not overlap with valid or test.
Valid [tfrecord | json/wav]: A validation set with 12,678 examples. Instruments do not overlap with train.
Test [tfrecord | json/wav]: A test set with 4,096 examples. Instruments do not overlap with train.
Below we detail how the note features are encoded in the Example protocol buffers and JSON files.
Example Features
Each Example contains the following features.
Feature
Type
Description
note
int64
A unique integer identifier for the note.
note_str
bytes
A unique string identifier for the note in the format <instrument_str>-<pitch>-<velocity>.
instrument
int64
A unique, sequential identifier for the instrument the note was synthesized from.
instrument_str
bytes
A unique string identifier for the instrument this note was synthesized from in the format <instrument_family_str>-<instrument_production_str>-<instrument_name>.
pitch
int64
The 0-based MIDI pitch in the range [0, 127].
velocity
int64
The 0-based MIDI velocity in the range [0, 127].
sample_rate
int64
The samples per second for the audio feature.
audio*
[float]
A list of audio samples represented as floating point values in the range [-1,1].
qualities
[int64]
A binary vector representing which sonic qualities are present in this note.
qualities_str
[bytes]
A list IDs of which qualities are present in this note selected from the sonic qualities list.