This is the official repository of the ACM Multimedia 2024 paper "SpeechCraft: A Fine-Grained Expressive Speech Dataset with Natural Language Description".
For details of the pipeline and dataset, please refer to our Paper and Demo Page
Language | Speech Corpus | #Duration | #Clips |
---|---|---|---|
ZH | Zhvoice | 799.68h | 1,020,427 |
ZH | AISHELL-3 | 63.70h | 63,011 |
EN | GigaSpeech-M | 739.91h | 670,070 |
EN | LibriTTS-R | 548.88h | 352,265 |
Description | Instruction | |
---|---|---|
ZH | download | download |
EN | download | download |
Since we do not own the copyright of the original audio files, for researchers and educators who wish to use the audio files for non-commercial research and/or educational purposes, we can provide access to our regenerated version under certain conditions and terms. To apply for the AISHELL-3 and LibriTTS-R with fine-grained keyword emphasis, please fill out the EULA form at Emphasis-SpeechCraft-EULA.pdf
and send the scanned form to jinzeyu23@mails.tsinghua.edu.cn. Once approved, you will be supplied with a download link.
Please first refer to some emphasis examples provided here. We are actively working on improving methods for large-scale fine-grained data construction that align with human perception.
Language | Speech Corpus | #Duration | #Clips |
---|---|---|---|
ZH | AISHELL-3-stress | 50.59h | 63,243 |
EN | LibriTTS-R-stress | 148.78h | 74,496 |
To be released.
Please cite our paper if you find this work useful:
@inproceedings{jin2024speechcraft,
title={SpeechCraft: A Fine-Grained Expressive Speech Dataset with Natural Language Description},
author={Zeyu Jin and Jia Jia and Qixin Wang and Kehan Li and Shuoyi Zhou and Songtao Zhou and Xiaoyu Qin and Zhiyong Wu},
booktitle={ACM Multimedia 2024},
year={2024},
url={https://openreview.net/forum?id=rjAY1DGUWC}
}