Speech Datasets Collection-语音数据集汇总

来源:https://github.com/RevoSpeechTech/speech-datasets-collection

openslr 下载:

1)修改为国内地址

例如 aishell,默认的run.sh里写的是www.openslr.org/resources/33,需要改为国内站点,http://openslr.magicdatatech.com/resources/33

其他目录可以看: http://openslr.magicdatatech.com/resources.php

在使用 wget 下载文件时,如果遇到下载速度慢的问题,可以通过以下几种方法加速下载:

1. 使用多个连接

wget 默认只使用单个连接进行下载,但是你可以使用 aria2 这种工具,它支持多线程下载,显著加速下载速度。aria2 可以通过以下命令安装:

sudo apt install aria2  # For Ubuntu/Debian
brew install aria2  # For macOS

然后你可以使用 aria2 下载文件:

aria2c -x 16 -s 16 <URL>

-x 16 表示使用 16 个连接来下载文件,-s 16 表示将下载源分为 16 个部分。

2. 使用 --limit-rate 限制下载速度

虽然这并不会直接加速下载,但如果下载的速度不稳定,设置一个合理的速率限制可以避免带宽波动影响速度。在命令中加上 --limit-rate 参数:

wget --limit-rate=1m <URL>

这将限制下载速度为每秒 1 MB。

3. 启用断点续传

如果下载过程中断,可以使用 -c--continue 参数来启用断点续传,从中断的地方继续下载:

wget -c <URL>

This is a curated list of open speech datasets for speech-related research (mainly for Automatic Speech Recognition).

Over 110 speech datasets are collected in this repository, and more than 70 datasets can be downloaded directly without further application or registration.

Notice:

  1. This repository does not show corresponding License of each dataset. Basically it’s OK to use these datasets for research purpose only. Please make sure the License is suitable before using for commercial purpose.
  2. Some small-scale speech corpora are not shown here for concision.

1. Data Overview

Dataset AcquisitionSup/UnsupAll Languages (Hours)Mandarin (Hours)English (Hours)
download directlysupervised199k +2110 +34k +
download directlyunsupervised530k +1360 +68k +
download directlytotal729k +3470 +102k +
need applicationsupervised53k +16740 +50k +
need applicationunsupervised60k +12400 +57k +
need applicationtotal113k +29140 +107k +
totalsupervised252k +18850 +84k +
totalunsupervised590k +13760 +125k +
totaltotal842k +32610 +209k +
  • Mandarin here includes Mandarin-English CS corpora.
  • Sup means supervised speech corpus with high-quality transcription.
  • Unsup means unsupervised or weakly-supervised speech corpus.

2. List of ASR corpora

a. datasets can be downloaded directly

idNameLanguageType/DomainPaper LinkData LinkSize (Hours)
1LibrispeechEnglishReading[paper][dataset]960
2TED_LIUM v1EnglishTalks[paper][dataset]118
3TED_LIUM v2EnglishTalks[paper][dataset]207
4TED_LIUM v3EnglishTalks[paper][dataset]452
5MLSMultilingualReading[paper][dataset]50k +
6thchs30MandarinReading[paper][dataset]35
7ST-CMDSMandarinCommands[dataset]100
8aishellMandarinRecording[paper][dataset]178
9aishell-3MandarinRecording[paper][dataset]85
10aishell-4MandarinMeeting[paper][dataset]120
11aishell-evalMandarinMisc[dataset]80 +
12PrimewordsMandarinRecording[dataset]100
13aidatatang_200zhMandarinRecord[dataset]200
14MagicDataMandarinRecording[dataset]755
15MagicData-RAMCMandarinConversational[paper][dataset]180
16Heavy Accent CorpusMandarinConversational[dataset]58 +
17AliMeetingMandarinMeeting[paper][dataset]120
18CN-CelebMandarinMisc[paper][dataset]unsup(274)
19CN-Celeb2MandarinMisc[paper][dataset]unsup(1090)
20The People’s SpeechEnglishMisc[paper][dataset]30k +
21Multilingual TEDxMultilingualTalks[paper][dataset]760 +
22VoxPopuliMultilingualMisc[paper][dataset]sup(1.8k)
unsup(400k)
23Libri-LightEnglishReading[paper][dataset]unsup(60k)
24Common Voice (Multilingual)MultilingualRecording[paper][dataset]sup(15k)
unsup(5k)
25Common Voice (English)EnglishRecording[paper][dataset]sup(2200)
unsup(700)
26JTubeSpeechJapaneseMisc[paper][dataset]1300
27ai4bharat NPTEL2020English(Indian)Lectures[dataset]weaksup(15.7k)
28open_sttRussianMisc[dataset]20k +
29ASCENDMandarin-English CSConversational[paper][dataset]10 +
30Crowd-Sourced SpeechMultilingualRecording[paper][dataset]1200 +
31Spoken WikipediaMultilingualRecording[paper][dataset]1000 +
32MuST-CMultilingualTalks[paper][dataset]6000 +
33M-AILABSMultilingualReading[dataset]1000
34CMU WildernessMultilingualMisc[paper][dataset]unsup(14k)
35Gram_VaaniHindiRecording[paper] [code][dataset]sup(100)
unsup(1k)
36VoxLingua107MultilingualMisc[paper][dataset]unsup(6600 +)
37Kazakh CorpusKazakhRecording[paper] [code][dataset]335
38VoxforgeEnglishRecording[dataset]130
39TatoebaEnglishRecording[dataset]200
40IndicWav2VecMultilingualMisc[paper][dataset]unsup(17k +)
41VoxCelebEnglishMisc[paper][dataset]unsup(352)
42VoxCeleb2EnglishMisc[paper][dataset]unsup(2442)
43RuLibrispeechRussianRead[dataset]98
44MediaSpeechMultilingualMisc[paper][dataset]40
45MUCS 2021 task1MultilingualMisc[dataset]300
46MUCS 2021 task2MultilingualMisc[dataset]150
47nicolingua-west-africanMultilingualMisc[paper][dataset]140 +
48Samromur 21.05SamromurMisc[code][dataset] [dataset][dataset]145
49Puebla-NahuatlPuebla-NahuatlMisc[paper][dataset]150 +
50GolosRussianMisc[paper][dataset]1240
51ParlaSpeech-HRCroatianParliament[paper][dataset]1816
52Lyon CorpusFrenchRecording[paper][dataset]185
53Providence CorpusEnglishRecording[paper][dataset]364
54CLARIN Spoken CorporaCzechRecording[dataset]1120 +
55Czech Parliament PlenaryCzechRecording[dataset]444
56(Youtube) Regional American CorpusEnglish (Accented)Misc[paper][dataset]29k +
57NISP DatasetMultilingualRecording[paper][dataset]56 +
58Regional African AmericanEnglish (Accented)Recording[paper][dataset]130 +
59Indonesian UnsupIndonesianMisc[dataset]unsup (3000+)
60Librivox-SpanishSpanishRecording[dataset]120
61AVSpeechEnglishAudio-Visual[paper][dataset]unsup(4700)
62CMLRMandarinAudio-Visual[paper][dataset]100 +
63Speech Accent ArchiveEnglishAccented[paper][dataset]TBC
64BibleTTSMultilingualTTS[paper][dataset]86
65NST-NorwegianNorwegianRecording[dataset]540
66NST-DanishDanishRecording[dataset]500 +
67NST-SwedishSwedishRecording[dataset]300 +
68NPSCNorwegianParliament[paper][dataset]140
69CI-AVSRCantoneseAudio-Visual[paper][dataset]8 +
70Aalto Finnish ParliamentFinnishParliament[paper][dataset]3100 +
71UserLibriEnglishReading[paper][dataset]
72Ukrainian SpeechUkrainianMisc[dataset]1300+
73UCLA-ASR-corpusMultilingualMisc[dataset]unsup(15k)
sup(9k)
74ReazonSpeechJapaneseMisc[paper] [code][dataset]15k
75BundestagGermanDebate[paper][dataset]sup(610)
unsup(1038)

b. datasets can be downloaded after application

idNameLanguageType/DomainPaper LinkData LinkSize (Hours)
1FisherEnglishConversational[paper][dataset]2000
2WenetSpeechMandarinMisc[paper][dataset]sup(10k)
weaksup(2.4k)
unsup(10k)
3aishell-2MandarinRecording[paper][dataset]1000
4aidatatang_1505zhMandarinRecording[dataset]1505
5SLT 2021 CSRCMandarinMisc[paper][dataset]400
6GigaSpeechEnglishMisc[paper][dataset]sup(10k)
unsup(23k)
7SPGISpeechEnglishMisc[paper][dataset]5000
8AESRC 2020English (accented)Misc[paper][dataset]160
9LaboroTVSpeechJapaneseMisc[paper][dataset]2000 +
10TAL_CSASRMandarin-English CSLectures[dataset]587
11ASRU 2019 ASRMandarin-English CSReading[dataset]700 +
12SEAMEMandarin-English CSRecording[paper][dataset]196
13Fearless StepsEnglishMisc[dataset]unsup(19k)
14FTSpeechDanishMeeting[paper][dataset]1800 +
15KeSpeechMandarinRecording[paper][dataset]1542
16KsponSpeechKoreanConversational[paper][dataset]969
17RVTE databaseSpanishTV[paper][dataset]800 +
18DiDiSpeechMandarinRecording[paper][dataset]800
19BabelMultilingualTelephone[paper][dataset]1000 +
20National Speech CorpusEnglish (Singapore)Misc[paper][dataset]3000 +
21MyST Children’s SpeechEnglishRecording[dataset]393
22L2-ARCTICL2 EnglishRecording[paper][dataset]20 +
23JSpeechMultilingualRecording[paper][dataset]1332 +
24LRS2-BBCEnglishAudio-Visual[paper][dataset]220 +
25LRS3-TEDEnglishAudio-Visual[paper][dataset]470 +
26LRS3-LangMultilingualAudio-Visual[dataset]1300 +
27QASRArabicDialects[paper][dataset]2000 +
28ADI (MGB-5)ArabicDialects[paper][dataset]unsup (3000 +)
29MGB-2ArabicTV[paper][dataset]1200 +
303MASSIVMultilingualAudio-Visual[paper][dataset]sup(310)
unsup(600)
31MDCCCantoneseMisc[paper][dataset]73 +
32Lahjoita PuhettaFinnishMisc[paper][dataset]sup(1600)
unsup(2000)
33SDS-200Swiss GermanDialects[paper][dataset]200
34Modality CorpusMultilingualAudio-Visual[paper][dataset]30 +
35Hindi-Tamil-EnglishMultilingualMisc[dataset]690
36English-Vietnamese CorpusEnglish, VietnameseMisc[paper][dataset]500+
37OLKAVSKoreanAudio-Visual[paper] [code][dataset]1150

3. References

发表评论

您的电子邮箱地址不会被公开。 必填项已用*标注