Downstream datasets

CLUE benchmark

CLUE is a Chinese Language Understanding Evaluation benchmark which contains classification, named entity recognition, and machine reading comprehension tasks. The datasets in CLUE are in JSON format. For classification and named entity recognition datasets, we convert the JSON format to TSV format so that TencentPretrain can load them directly. For machine reading comprehension, the original format is retained and the dataset pre-processing is included in the project.

Classification:

Dataset	Link
TNEWS	https://share.weiyun.com/maExfIeO
CSL	https://share.weiyun.com/LftIGlIT
CMNLI	https://share.weiyun.com/hn3kTeKm
OCNLI	https://share.weiyun.com/wkltwNwg
AFQMC	https://share.weiyun.com/CdlEKMON
IFLYTEK	https://share.weiyun.com/ldiLjnZJ
CLUEWSC2020	https://share.weiyun.com/RLL1ShBi

Machine reading comprehension:

Dataset	Link
CMRC2018	https://share.weiyun.com/KwAbnX60
C3	https://share.weiyun.com/JDpgczdp
ChID	https://share.weiyun.com/8KJE3NOz

Named entity recognition:

Dataset	Link
CLUENER2020	https://share.weiyun.com/smSMtLkn

Baidu ERNIE

ERNIE provides 5 Chinese datasets in its first version and use them to test ERNIE's performance.

Dataset	Link
ChnSentiCorp	https://share.weiyun.com/BRujeOQT
LCQMC	https://share.weiyun.com/5Fmf2SZ
XNLI	https://share.weiyun.com/mcd8EApl
MSRA-NER	https://share.weiyun.com/ua1Z5w2r
NLPCC-DBQA	https://share.weiyun.com/5HJMbih

Competition dataset

Dataset	Link
SMP2020-EWECT	https://share.weiyun.com/uFGEhrWp
SMP2019-ECISA	https://share.weiyun.com/MgHL8QSI
CCF-BDCI2021-Corrupted_Short_Message_Reconstruction	https://share.weiyun.com/xHr6OkQw

GLUE benchmark

GLUE is an English Language Understanding Evaluation benchmark which contains classification and regression tasks. We convert the datasets in GLUE to TSV format so that TencentPretrain can load them directly.

Dataset	Link
CoLA	https://share.weiyun.com/n5kPUmsr
SST-2	https://share.weiyun.com/48noHt6Y
MRPC	https://share.weiyun.com/7nXAjpYo
STS-B	https://share.weiyun.com/8DJUM18K
QQP	https://share.weiyun.com/1k6IGbfj
MNLI	https://share.weiyun.com/tzMoGpIe
QNLI	https://share.weiyun.com/J7LQKCYY
RTE	https://share.weiyun.com/EnGVoElX
WNLI	https://share.weiyun.com/752vzwjP

Vision dataset

Dataset	Link
CIFAR10	https://share.weiyun.com/s4oS4HWN
CIFAR100	https://share.weiyun.com/7UJfHbib

Audio dataset

The dataset is a 10h subset of LibriSpeech/train-clean-100. The original dataset can be downloaded here.

Dataset	Link
LibriSpeech_10h	https://share.weiyun.com/QRTYgFEK

Home
主页
- 项目特色
- 依赖环境
- 快速上手
- 预训练数据
- 下游任务数据集
- 预训练模型仓库
- 使用说明
- 竞赛解决方案
  - 中文任务测评基准CLUE
  - SMP2020-EWECT
  - SMP2019-ECISA
  - CCF-BDCI2021-面向黑灰产治理的恶意短信变体字还原
  - 英文任务测评基准GLUE
  - 视觉任务评测基准
- 引用

Provide feedback

Saved searches

Use saved searches to filter your results more quickly