-
Notifications
You must be signed in to change notification settings - Fork 143
Downstream datasets
CLUE is a Chinese Language Understanding Evaluation benchmark which contains classification, named entity recognition, and machine reading comprehension tasks. The datasets in CLUE are in JSON format. For classification and named entity recognition datasets, we convert the JSON format to TSV format so that TencentPretrain can load them directly. For machine reading comprehension, the original format is retained and the dataset pre-processing is included in the project.
Classification:
Dataset | Link |
---|---|
TNEWS | https://share.weiyun.com/maExfIeO |
CSL | https://share.weiyun.com/LftIGlIT |
CMNLI | https://share.weiyun.com/hn3kTeKm |
OCNLI | https://share.weiyun.com/wkltwNwg |
AFQMC | https://share.weiyun.com/CdlEKMON |
IFLYTEK | https://share.weiyun.com/ldiLjnZJ |
CLUEWSC2020 | https://share.weiyun.com/RLL1ShBi |
Machine reading comprehension:
Dataset | Link |
---|---|
CMRC2018 | https://share.weiyun.com/KwAbnX60 |
C3 | https://share.weiyun.com/JDpgczdp |
ChID | https://share.weiyun.com/8KJE3NOz |
Named entity recognition:
Dataset | Link |
---|---|
CLUENER2020 | https://share.weiyun.com/smSMtLkn |
ERNIE provides 5 Chinese datasets in its first version and use them to test ERNIE's performance.
Dataset | Link |
---|---|
ChnSentiCorp | https://share.weiyun.com/BRujeOQT |
LCQMC | https://share.weiyun.com/5Fmf2SZ |
XNLI | https://share.weiyun.com/mcd8EApl |
MSRA-NER | https://share.weiyun.com/ua1Z5w2r |
NLPCC-DBQA | https://share.weiyun.com/5HJMbih |
Dataset | Link |
---|---|
SMP2020-EWECT | https://share.weiyun.com/uFGEhrWp |
SMP2019-ECISA | https://share.weiyun.com/MgHL8QSI |
CCF-BDCI2021-Corrupted_Short_Message_Reconstruction | https://share.weiyun.com/xHr6OkQw |
GLUE is an English Language Understanding Evaluation benchmark which contains classification and regression tasks. We convert the datasets in GLUE to TSV format so that TencentPretrain can load them directly.
Dataset | Link |
---|---|
CIFAR10 | https://share.weiyun.com/s4oS4HWN |
CIFAR100 | https://share.weiyun.com/7UJfHbib |
The dataset is a 10h subset of LibriSpeech/train-clean-100. The original dataset can be downloaded here.
Dataset | Link |
---|---|
LibriSpeech_10h | https://share.weiyun.com/QRTYgFEK |