A large-scale and high-quality dataset of annotated musical notes.( 一个大规模、高质量的注释音符数据集。)
下载地址:https://magenta.tensorflow.org/datasets/nsynth#files
Motivation
Recent breakthroughs in generative modeling of images have been predicated on the availability of high-quality and large-scale datasebts such as MNIST, CIFAR and ImageNet. We recognized the need for an audio dataset that was as approachable as those in the image domain.
Audio signals found in the wild contain multi-scale dependencies that prove particularly difficult to model, leading many previous efforts at data-driven audio synthesis to focus on more constrained domains such as texture synthesis or training small parametric models.
We encourage the broader community to use NSynth as a benchmark and entry point into audio machine learning. We also view NSynth as a building block for future datasets and envision a high-quality multi-note dataset for tasks like generation and transcription that involve learning complex language-like dependencies.
Description
NSynth is an audio dataset containing 305,979 musical notes, each with a unique pitch, timbre, and envelope. For 1,006 instruments from commercial sample libraries, we generated four second, monophonic 16kHz audio snippets, referred to as notes, by ranging over every pitch of a standard MIDI pian o (21-108) as well as five different velocities (25, 50, 75, 100, 127). The note was held for the first three seconds and allowed to decay for the final second.
Some instruments are not capable of producing all 88 pitches in this range, resulting in an average of 65.4 pitches per instrument. Furthermore, the commercial sample packs occasionally contain duplicate sounds across multiple velocities, leaving an average of 4.75 unique velocities per pitch.
We also annotated each of the notes with three additional pieces of information based on a combination of human evaluation and heuristic algorithms:
- Source: The method of sound production for the note’s instrument. This can be one of
acoustic
orelectronic
for instruments that were recorded from acoustic or electronic instruments, respectively, orsynthetic
for synthesized instruments. See their frequencies below. - Family: The high-level family of which the note’s instrument is a member. Each instrument is a member of exactly one family. See the complete list and their frequencies below.
- Qualities: Sonic qualities of the note. See the quality descriptions and their co-occurrences below. Each note is annotated with zero or more qualities.
Format
Files
The NSynth dataset can be download in two formats:
- TFRecord files of serialized TensorFlow Example protocol buffers with one Example proto per note.
- JSON files containing non-audio features alongside 16-bit PCM WAV audio files.
The full dataset is split into three sets:
- Train [tfrecord | json/wav]: A training set with 289,205 examples. Instruments do not overlap with valid or test.
- Valid [tfrecord | json/wav]: A validation set with 12,678 examples. Instruments do not overlap with train.
- Test [tfrecord | json/wav]: A test set with 4,096 examples. Instruments do not overlap with train.
Below we detail how the note features are encoded in the Example protocol buffers and JSON files.
Example Features
Each Example contains the following features.
Feature | Type | Description |
---|---|---|
note | int64 | A unique integer identifier for the note. |
note_str | bytes | A unique string identifier for the note in the format <instrument_str>-<pitch>-<velocity> . |
instrument | int64 | A unique, sequential identifier for the instrument the note was synthesized from. |
instrument_str | bytes | A unique string identifier for the instrument this note was synthesized from in the format <instrument_family_str>-<instrument_production_str>-<instrument_name> . |
pitch | int64 | The 0-based MIDI pitch in the range [0, 127]. |
velocity | int64 | The 0-based MIDI velocity in the range [0, 127]. |
sample_rate | int64 | The samples per second for the audio feature. |
audio* | [float] | A list of audio samples represented as floating point values in the range [-1,1]. |
qualities | [int64] | A binary vector representing which sonic qualities are present in this note. |
qualities_str | [bytes] | A list IDs of which qualities are present in this note selected from the sonic qualities list. |
instrument_family | int64 | The index of the instrument family this instrument is a member of. |
instrument_family_str | bytes | The ID of the instrument family this instrument is a member of. |
instrument_source | int64 | The index of the sonic source for this instrument. |
instrument_source_str | bytes | The ID of the sonic source for this instrument. |
* Note: the “audio” feature is ommited from the JSON-encoded examples since the audio data is stored separately in WAV files keyed by the “note_str”.