Torchaudio is a library for audio and signal processing with PyTorch. It provides I/O, signal and data processing functions, datasets, model implementations and application components.
Recent breakthroughs in generative modeling of images have been predicated on the availability of high-quality and large-scale datasebts such as MNIST, CIFAR and ImageNet. We recognized the need for an audio dataset that was as approachable as those in the image domain.
Audio signals found in the wild contain multi-scale dependencies that prove particularly difficult to model, leading many previous efforts at data-driven audio synthesis to focus on more constrained domains such as texture synthesis or training small parametric models.
We encourage the broader community to use NSynth as a benchmark and entry point into audio machine learning. We also view NSynth as a building block for future datasets and envision a high-quality multi-note dataset for tasks like generation and transcription that involve learning complex language-like dependencies.
Description
NSynth is an audio dataset containing 305,979 musical notes, each with a unique pitch, timbre, and envelope. For 1,006 instruments from commercial sample libraries, we generated four second, monophonic 16kHz audio snippets, referred to as notes, by ranging over every pitch of a standard MIDI pian o (21-108) as well as five different velocities (25, 50, 75, 100, 127). The note was held for the first three seconds and allowed to decay for the final second.
Some instruments are not capable of producing all 88 pitches in this range, resulting in an average of 65.4 pitches per instrument. Furthermore, the commercial sample packs occasionally contain duplicate sounds across multiple velocities, leaving an average of 4.75 unique velocities per pitch.
We also annotated each of the notes with three additional pieces of information based on a combination of human evaluation and heuristic algorithms:
Source: The method of sound production for the note’s instrument. This can be one of acoustic or electronic for instruments that were recorded from acoustic or electronic instruments, respectively, or synthetic for synthesized instruments. See their frequencies below.
Family: The high-level family of which the note’s instrument is a member. Each instrument is a member of exactly one family. See the complete list and their frequencies below.
Qualities: Sonic qualities of the note. See the quality descriptions and their co-occurrences below. Each note is annotated with zero or more qualities.
Format
Files
The NSynth dataset can be download in two formats:
Train [tfrecord | json/wav]: A training set with 289,205 examples. Instruments do not overlap with valid or test.
Valid [tfrecord | json/wav]: A validation set with 12,678 examples. Instruments do not overlap with train.
Test [tfrecord | json/wav]: A test set with 4,096 examples. Instruments do not overlap with train.
Below we detail how the note features are encoded in the Example protocol buffers and JSON files.
Example Features
Each Example contains the following features.
Feature
Type
Description
note
int64
A unique integer identifier for the note.
note_str
bytes
A unique string identifier for the note in the format <instrument_str>-<pitch>-<velocity>.
instrument
int64
A unique, sequential identifier for the instrument the note was synthesized from.
instrument_str
bytes
A unique string identifier for the instrument this note was synthesized from in the format <instrument_family_str>-<instrument_production_str>-<instrument_name>.
pitch
int64
The 0-based MIDI pitch in the range [0, 127].
velocity
int64
The 0-based MIDI velocity in the range [0, 127].
sample_rate
int64
The samples per second for the audio feature.
audio*
[float]
A list of audio samples represented as floating point values in the range [-1,1].
qualities
[int64]
A binary vector representing which sonic qualities are present in this note.
qualities_str
[bytes]
A list IDs of which qualities are present in this note selected from the sonic qualities list.
在矢量量化编码中,关键是码本的建立和码字搜索算法,如果想对矢量量化有个整体的概览,强烈推荐《Handbook of Image and Video Processing》一书中Fundamentals of Vector Quantization章节。下面对矢量量化中两类典型的方法多阶段矢量量化、乘积量化以及乘积量化的改进做简单介绍。
如上图所示,对于待量化的向量x,经过一级量化器quantizer1后,得到的量化残差为r1 = x – C1b1,其中C1为一级量化器的码本,b1为x经过一级量化器quantizer1后的表示结果,将一级量化误差r1作为二级量化器的输入,后面过程与此类似。通过这种级联量化的量化方式,当构建的量化器为无穷个时,x可以被这无穷个码本精确表示。上图右侧子图比较直观的描绘了x被多个码本逐步近似的过程。
AI 生成的音频其实很常见,像生活中用到的语音助手使用自然语言处理声音。OpenAI 曾开发名为 Jukebox 的 AI 音乐系统也令人印象深刻。但过去用 AI 生成音频,大都需要人们提前准备转录和标记基于文本的训练数据,这需要耗费极大时间和人力。而谷歌在其官方博文中表示:“AudioLM 是纯音频语言模型,无须借助文本来训练,只是从原始音频中进行学习。”
相较之前的类似系统,AudioLM 生成的音频在语音语法、音乐旋律等方面,具有长时间的一致性和高保真度。9 月 7 日,相关论文以《AudioLM: 一种实现音频生成的语言建模方法》(AudioLM: a Language Modeling Approach to Audio Generation)为题提交在 arXiv 上。正如音乐从单个音符构建复杂的音乐短语一样。生成逼真的音频需要以不同比例表示的建模信息。而在所有这些音阶上创建结构良好且连贯的音频序列是一项挑战。据了解,音频语言模型 AudioLM 的背后利用了文本到图像模型的进步来生成音频。
Overview of our approach. A sequence-to-sequence Transformer model is trained on many different speech processing tasks, including multilingual speech recognition, speech translation, spoken language identification, and voice activity detection
目前主流的语音识别方法是先进行大规模的无监督预训练(Wav2Vec 2.0),比如, Wav2Vec 采集了1000000h的无标签训练数据,先用这些数据进行预训练一个编码器(使用对比学习 or 字训练),encoder能够对语音数据做一个很好的编码,然后在面向下游任务时,可以在标准训练集中做微调(只需要几十小时的数据就可),这样比只在标准数据集上训练的结果好很多。
这些预训练好的语音编码器能够学习到语音的一个高质量表示,但是用无监督方法训练的编码器仍然需要训练一个解码器,需要用带标签的数据来微调,微调是一个很复杂的过程,如果不需要微调就好了,这也是本文要做的工作。此外,过去的工作缺乏一个很好的解码器,这是一个巨大的缺陷,而语音识别系统就是应该是“out of box”,也就是拿来即用。
数据部分是本文最核心的贡献。由于数据够多,模型够强,本文模型直接预测原始文本,而不经过任何标准化(standardization)。从而模型的输出就是最终识别结果,而无需经过反向的文本归一化(inverse text normalization)后处理。所谓文本归一化包括如将所有单词变小写,所有简写展开,所有标点去掉等操作,而反向文本归一化就是上述操作的反过程。在 Whisper 中,这些操作统统不用,因为数据足够多,可以覆盖所有的情况。