Skip to content

Downstream datasets

zhezhaoa edited this page Oct 26, 2023 · 7 revisions

CLUE benchmark

CLUE is a Chinese Language Understanding Evaluation benchmark which contains classification, named entity recognition, and machine reading comprehension tasks. The datasets in CLUE are in JSON format. For classification and named entity recognition datasets, we convert the JSON format to TSV format so that TencentPretrain can load them directly. For machine reading comprehension, the original format is retained and the dataset pre-processing is included in the project.

Classification:

Dataset Link
TNEWS https://share.weiyun.com/maExfIeO
CSL https://share.weiyun.com/LftIGlIT
CMNLI https://share.weiyun.com/hn3kTeKm
OCNLI https://share.weiyun.com/wkltwNwg
AFQMC https://share.weiyun.com/CdlEKMON
IFLYTEK https://share.weiyun.com/ldiLjnZJ
CLUEWSC2020 https://share.weiyun.com/RLL1ShBi

Machine reading comprehension:

Dataset Link
CMRC2018 https://share.weiyun.com/KwAbnX60
C3 https://share.weiyun.com/JDpgczdp
ChID https://share.weiyun.com/8KJE3NOz

Named entity recognition:

Dataset Link
CLUENER2020 https://share.weiyun.com/smSMtLkn

Baidu ERNIE

ERNIE provides 5 Chinese datasets in its first version and use them to test ERNIE's performance.

Dataset Link
ChnSentiCorp https://share.weiyun.com/BRujeOQT
LCQMC https://share.weiyun.com/5Fmf2SZ
XNLI https://share.weiyun.com/mcd8EApl
MSRA-NER https://share.weiyun.com/ua1Z5w2r
NLPCC-DBQA https://share.weiyun.com/5HJMbih

Competition dataset

Dataset Link
SMP2020-EWECT https://share.weiyun.com/uFGEhrWp
SMP2019-ECISA https://share.weiyun.com/MgHL8QSI
CCF-BDCI2021-Corrupted_Short_Message_Reconstruction https://share.weiyun.com/xHr6OkQw

GLUE benchmark

GLUE is an English Language Understanding Evaluation benchmark which contains classification and regression tasks. We convert the datasets in GLUE to TSV format so that TencentPretrain can load them directly.

Dataset Link
CoLA https://share.weiyun.com/n5kPUmsr
SST-2 https://share.weiyun.com/48noHt6Y
MRPC https://share.weiyun.com/7nXAjpYo
STS-B https://share.weiyun.com/8DJUM18K
QQP https://share.weiyun.com/1k6IGbfj
MNLI https://share.weiyun.com/tzMoGpIe
QNLI https://share.weiyun.com/J7LQKCYY
RTE https://share.weiyun.com/EnGVoElX
WNLI https://share.weiyun.com/752vzwjP

Vision dataset

Dataset Link
CIFAR10 https://share.weiyun.com/s4oS4HWN
CIFAR100 https://share.weiyun.com/7UJfHbib

Audio dataset

The dataset is a 10h subset of LibriSpeech/train-clean-100. The original dataset can be downloaded here.

Dataset Link
LibriSpeech_10h https://share.weiyun.com/QRTYgFEK
Clone this wiki locally