来源:https://github.com/RevoSpeechTech/speech-datasets-collection
openslr 下载:
1)修改为国内地址
例如 aishell,默认的run.sh里写的是www.openslr.org/resources/33,需要改为国内站点,http://openslr.magicdatatech.com/resources/33。
其他目录可以看: http://openslr.magicdatatech.com/resources.php
在使用 wget
下载文件时,如果遇到下载速度慢的问题,可以通过以下几种方法加速下载:
1. 使用多个连接
wget
默认只使用单个连接进行下载,但是你可以使用 aria2
这种工具,它支持多线程下载,显著加速下载速度。aria2
可以通过以下命令安装:
sudo apt install aria2 # For Ubuntu/Debian
brew install aria2 # For macOS
然后你可以使用 aria2
下载文件:
aria2c -x 16 -s 16 <URL>
-x 16
表示使用 16 个连接来下载文件,-s 16
表示将下载源分为 16 个部分。
2. 使用 --limit-rate
限制下载速度
虽然这并不会直接加速下载,但如果下载的速度不稳定,设置一个合理的速率限制可以避免带宽波动影响速度。在命令中加上 --limit-rate
参数:
wget --limit-rate=1m <URL>
这将限制下载速度为每秒 1 MB。
3. 启用断点续传
如果下载过程中断,可以使用 -c
或 --continue
参数来启用断点续传,从中断的地方继续下载:
wget -c <URL>
This is a curated list of open speech datasets for speech-related research (mainly for Automatic Speech Recognition).
Over 110 speech datasets are collected in this repository, and more than 70 datasets can be downloaded directly without further application or registration.
Notice:
- This repository does not show corresponding License of each dataset. Basically it’s OK to use these datasets for research purpose only. Please make sure the License is suitable before using for commercial purpose.
- Some small-scale speech corpora are not shown here for concision.
1. Data Overview
Dataset Acquisition | Sup/Unsup | All Languages (Hours) | Mandarin (Hours) | English (Hours) |
---|---|---|---|---|
download directly | supervised | 199k + | 2110 + | 34k + |
download directly | unsupervised | 530k + | 1360 + | 68k + |
download directly | total | 729k + | 3470 + | 102k + |
need application | supervised | 53k + | 16740 + | 50k + |
need application | unsupervised | 60k + | 12400 + | 57k + |
need application | total | 113k + | 29140 + | 107k + |
total | supervised | 252k + | 18850 + | 84k + |
total | unsupervised | 590k + | 13760 + | 125k + |
total | total | 842k + | 32610 + | 209k + |
- Mandarin here includes Mandarin-English CS corpora.
- Sup means supervised speech corpus with high-quality transcription.
- Unsup means unsupervised or weakly-supervised speech corpus.
2. List of ASR corpora
a. datasets can be downloaded directly
id | Name | Language | Type/Domain | Paper Link | Data Link | Size (Hours) |
---|---|---|---|---|---|---|
1 | Librispeech | English | Reading | [paper] | [dataset] | 960 |
2 | TED_LIUM v1 | English | Talks | [paper] | [dataset] | 118 |
3 | TED_LIUM v2 | English | Talks | [paper] | [dataset] | 207 |
4 | TED_LIUM v3 | English | Talks | [paper] | [dataset] | 452 |
5 | MLS | Multilingual | Reading | [paper] | [dataset] | 50k + |
6 | thchs30 | Mandarin | Reading | [paper] | [dataset] | 35 |
7 | ST-CMDS | Mandarin | Commands | – | [dataset] | 100 |
8 | aishell | Mandarin | Recording | [paper] | [dataset] | 178 |
9 | aishell-3 | Mandarin | Recording | [paper] | [dataset] | 85 |
10 | aishell-4 | Mandarin | Meeting | [paper] | [dataset] | 120 |
11 | aishell-eval | Mandarin | Misc | – | [dataset] | 80 + |
12 | Primewords | Mandarin | Recording | – | [dataset] | 100 |
13 | aidatatang_200zh | Mandarin | Record | – | [dataset] | 200 |
14 | MagicData | Mandarin | Recording | – | [dataset] | 755 |
15 | MagicData-RAMC | Mandarin | Conversational | [paper] | [dataset] | 180 |
16 | Heavy Accent Corpus | Mandarin | Conversational | – | [dataset] | 58 + |
17 | AliMeeting | Mandarin | Meeting | [paper] | [dataset] | 120 |
18 | CN-Celeb | Mandarin | Misc | [paper] | [dataset] | unsup(274) |
19 | CN-Celeb2 | Mandarin | Misc | [paper] | [dataset] | unsup(1090) |
20 | The People’s Speech | English | Misc | [paper] | [dataset] | 30k + |
21 | Multilingual TEDx | Multilingual | Talks | [paper] | [dataset] | 760 + |
22 | VoxPopuli | Multilingual | Misc | [paper] | [dataset] | sup(1.8k) unsup(400k) |
23 | Libri-Light | English | Reading | [paper] | [dataset] | unsup(60k) |
24 | Common Voice (Multilingual) | Multilingual | Recording | [paper] | [dataset] | sup(15k) unsup(5k) |
25 | Common Voice (English) | English | Recording | [paper] | [dataset] | sup(2200) unsup(700) |
26 | JTubeSpeech | Japanese | Misc | [paper] | [dataset] | 1300 |
27 | ai4bharat NPTEL2020 | English(Indian) | Lectures | – | [dataset] | weaksup(15.7k) |
28 | open_stt | Russian | Misc | – | [dataset] | 20k + |
29 | ASCEND | Mandarin-English CS | Conversational | [paper] | [dataset] | 10 + |
30 | Crowd-Sourced Speech | Multilingual | Recording | [paper] | [dataset] | 1200 + |
31 | Spoken Wikipedia | Multilingual | Recording | [paper] | [dataset] | 1000 + |
32 | MuST-C | Multilingual | Talks | [paper] | [dataset] | 6000 + |
33 | M-AILABS | Multilingual | Reading | – | [dataset] | 1000 |
34 | CMU Wilderness | Multilingual | Misc | [paper] | [dataset] | unsup(14k) |
35 | Gram_Vaani | Hindi | Recording | [paper] [code] | [dataset] | sup(100) unsup(1k) |
36 | VoxLingua107 | Multilingual | Misc | [paper] | [dataset] | unsup(6600 +) |
37 | Kazakh Corpus | Kazakh | Recording | [paper] [code] | [dataset] | 335 |
38 | Voxforge | English | Recording | – | [dataset] | 130 |
39 | Tatoeba | English | Recording | – | [dataset] | 200 |
40 | IndicWav2Vec | Multilingual | Misc | [paper] | [dataset] | unsup(17k +) |
41 | VoxCeleb | English | Misc | [paper] | [dataset] | unsup(352) |
42 | VoxCeleb2 | English | Misc | [paper] | [dataset] | unsup(2442) |
43 | RuLibrispeech | Russian | Read | – | [dataset] | 98 |
44 | MediaSpeech | Multilingual | Misc | [paper] | [dataset] | 40 |
45 | MUCS 2021 task1 | Multilingual | Misc | – | [dataset] | 300 |
46 | MUCS 2021 task2 | Multilingual | Misc | – | [dataset] | 150 |
47 | nicolingua-west-african | Multilingual | Misc | [paper] | [dataset] | 140 + |
48 | Samromur 21.05 | Samromur | Misc | [code] | [dataset] [dataset][dataset] | 145 |
49 | Puebla-Nahuatl | Puebla-Nahuatl | Misc | [paper] | [dataset] | 150 + |
50 | Golos | Russian | Misc | [paper] | [dataset] | 1240 |
51 | ParlaSpeech-HR | Croatian | Parliament | [paper] | [dataset] | 1816 |
52 | Lyon Corpus | French | Recording | [paper] | [dataset] | 185 |
53 | Providence Corpus | English | Recording | [paper] | [dataset] | 364 |
54 | CLARIN Spoken Corpora | Czech | Recording | – | [dataset] | 1120 + |
55 | Czech Parliament Plenary | Czech | Recording | – | [dataset] | 444 |
56 | (Youtube) Regional American Corpus | English (Accented) | Misc | [paper] | [dataset] | 29k + |
57 | NISP Dataset | Multilingual | Recording | [paper] | [dataset] | 56 + |
58 | Regional African American | English (Accented) | Recording | [paper] | [dataset] | 130 + |
59 | Indonesian Unsup | Indonesian | Misc | – | [dataset] | unsup (3000+) |
60 | Librivox-Spanish | Spanish | Recording | – | [dataset] | 120 |
61 | AVSpeech | English | Audio-Visual | [paper] | [dataset] | unsup(4700) |
62 | CMLR | Mandarin | Audio-Visual | [paper] | [dataset] | 100 + |
63 | Speech Accent Archive | English | Accented | [paper] | [dataset] | TBC |
64 | BibleTTS | Multilingual | TTS | [paper] | [dataset] | 86 |
65 | NST-Norwegian | Norwegian | Recording | – | [dataset] | 540 |
66 | NST-Danish | Danish | Recording | – | [dataset] | 500 + |
67 | NST-Swedish | Swedish | Recording | – | [dataset] | 300 + |
68 | NPSC | Norwegian | Parliament | [paper] | [dataset] | 140 |
69 | CI-AVSR | Cantonese | Audio-Visual | [paper] | [dataset] | 8 + |
70 | Aalto Finnish Parliament | Finnish | Parliament | [paper] | [dataset] | 3100 + |
71 | UserLibri | English | Reading | [paper] | [dataset] | – |
72 | Ukrainian Speech | Ukrainian | Misc | – | [dataset] | 1300+ |
73 | UCLA-ASR-corpus | Multilingual | Misc | – | [dataset] | unsup(15k) sup(9k) |
74 | ReazonSpeech | Japanese | Misc | [paper] [code] | [dataset] | 15k |
75 | Bundestag | German | Debate | [paper] | [dataset] | sup(610) unsup(1038) |
b. datasets can be downloaded after application
id | Name | Language | Type/Domain | Paper Link | Data Link | Size (Hours) |
---|---|---|---|---|---|---|
1 | Fisher | English | Conversational | [paper] | [dataset] | 2000 |
2 | WenetSpeech | Mandarin | Misc | [paper] | [dataset] | sup(10k) weaksup(2.4k) unsup(10k) |
3 | aishell-2 | Mandarin | Recording | [paper] | [dataset] | 1000 |
4 | aidatatang_1505zh | Mandarin | Recording | – | [dataset] | 1505 |
5 | SLT 2021 CSRC | Mandarin | Misc | [paper] | [dataset] | 400 |
6 | GigaSpeech | English | Misc | [paper] | [dataset] | sup(10k) unsup(23k) |
7 | SPGISpeech | English | Misc | [paper] | [dataset] | 5000 |
8 | AESRC 2020 | English (accented) | Misc | [paper] | [dataset] | 160 |
9 | LaboroTVSpeech | Japanese | Misc | [paper] | [dataset] | 2000 + |
10 | TAL_CSASR | Mandarin-English CS | Lectures | – | [dataset] | 587 |
11 | ASRU 2019 ASR | Mandarin-English CS | Reading | – | [dataset] | 700 + |
12 | SEAME | Mandarin-English CS | Recording | [paper] | [dataset] | 196 |
13 | Fearless Steps | English | Misc | – | [dataset] | unsup(19k) |
14 | FTSpeech | Danish | Meeting | [paper] | [dataset] | 1800 + |
15 | KeSpeech | Mandarin | Recording | [paper] | [dataset] | 1542 |
16 | KsponSpeech | Korean | Conversational | [paper] | [dataset] | 969 |
17 | RVTE database | Spanish | TV | [paper] | [dataset] | 800 + |
18 | DiDiSpeech | Mandarin | Recording | [paper] | [dataset] | 800 |
19 | Babel | Multilingual | Telephone | [paper] | [dataset] | 1000 + |
20 | National Speech Corpus | English (Singapore) | Misc | [paper] | [dataset] | 3000 + |
21 | MyST Children’s Speech | English | Recording | – | [dataset] | 393 |
22 | L2-ARCTIC | L2 English | Recording | [paper] | [dataset] | 20 + |
23 | JSpeech | Multilingual | Recording | [paper] | [dataset] | 1332 + |
24 | LRS2-BBC | English | Audio-Visual | [paper] | [dataset] | 220 + |
25 | LRS3-TED | English | Audio-Visual | [paper] | [dataset] | 470 + |
26 | LRS3-Lang | Multilingual | Audio-Visual | – | [dataset] | 1300 + |
27 | QASR | Arabic | Dialects | [paper] | [dataset] | 2000 + |
28 | ADI (MGB-5) | Arabic | Dialects | [paper] | [dataset] | unsup (3000 +) |
29 | MGB-2 | Arabic | TV | [paper] | [dataset] | 1200 + |
30 | 3MASSIV | Multilingual | Audio-Visual | [paper] | [dataset] | sup(310) unsup(600) |
31 | MDCC | Cantonese | Misc | [paper] | [dataset] | 73 + |
32 | Lahjoita Puhetta | Finnish | Misc | [paper] | [dataset] | sup(1600) unsup(2000) |
33 | SDS-200 | Swiss German | Dialects | [paper] | [dataset] | 200 |
34 | Modality Corpus | Multilingual | Audio-Visual | [paper] | [dataset] | 30 + |
35 | Hindi-Tamil-English | Multilingual | Misc | – | [dataset] | 690 |
36 | English-Vietnamese Corpus | English, Vietnamese | Misc | [paper] | [dataset] | 500+ |
37 | OLKAVS | Korean | Audio-Visual | [paper] [code] | [dataset] | 1150 |