HierSpeech++: Bridging the Gap between Semantic and Acoustic Representation of Speech by Hierarchical Variational Inference for Zero-shot Speech Synthesis
F5-TTS是一款基于流匹配的全非自回归文本到语音转换模型,由上海交通大学(Shanghai Jiao Tong University)、剑桥大学(University of Cambridge)、以及极氪汽车研究院(Geely Automobile Research Institute (Ningbo) Company Ltd.)的研究团队联合开发的。具有以下特点:
|-- openemilia_all.tar.gz (all .JSONL files are gzipped with directory structure in this file)
|-- EN (114 batches)
| |-- EN_B00000.jsonl
| |-- EN_B00000 (= EN_B00000.tar.gz)
| | |-- EN_B00000_S00000
| | | `-- mp3
| | | |-- EN_B00000_S00000_W000000.mp3
| | | `-- EN_B00000_S00000_W000001.mp3
| | |-- ...
| |-- ...
| |-- EN_B00113.jsonl
| `-- EN_B00113
|-- ZH (92 batches)
|-- DE (9 batches)
|-- FR (10 batches)
|-- JA (7 batches)
|-- KO (4 batches)
JSONL 文件示例:
{"id": "EN_B00000_S00000_W000000", "wav": "EN_B00000/EN_B00000_S00000/mp3/EN_B00000_S00000_W000000.mp3", "text": " You can help my mother and you- No. You didn't leave a bad situation back home to get caught up in another one here. What happened to you, Los Angeles?", "duration": 6.264, "speaker": "EN_B00000_S00000", "language": "en", "dnsmos": 3.2927}
{"id": "EN_B00000_S00000_W000001", "wav": "EN_B00000/EN_B00000_S00000/mp3/EN_B00000_S00000_W000001.mp3", "text": " Honda's gone, 20 squads done. X is gonna split us up and put us on different squads. The team's come and go, but 20 squad, can't believe it's ending.", "duration": 8.031, "speaker": "EN_B00000_S00000", "language": "en", "dnsmos": 3.0442}