通用信息抽取数据集收集(Viscacha)

这是Viscacha项目的存储库，该项目旨在构建一个通用信息抽取数据集的统一收集工作。

欢迎您向我们提供任何未收集的信息抽取数据集(或其来源)。我们将统一它们的格式，并通过我们所构建的instructions融入统一的数据集中。

数据集合 (Data Collection)

语言:

EN: English (英文)
CN: Chinese (中文)
ML: Multiple languages (多语言)

任务:

NER: Named Entity Recognition (命名实体识别)
RE: Relation Extraction (关系抽取)
EE: Event Extraction (事件抽取)

数据集	领域	数目	语言	任务	来源
DuIE2.0	人文	191K	CN	RE	https://www.luge.ai/#/luge/dataDetail?id=5
DuEE1.0	新闻	17K	CN	EE	https://www.luge.ai/#/luge/dataDetail?id=6
DuEE-fin	金融	11.7K	CN	EE	https://www.luge.ai/#/luge/dataDetail?id=7
IREE	金融	50K	CN	EE	https://www.luge.ai/#/luge/dataDetail?id=72
SanWen	中国文学	21K	CN	RE	https://github.com/thunlp/Chinese_NRE/tree/master/data/SanWen
BosonNER	通用	10K	CN	NER	https://github.com/HuHsinpang/BosonNER-Pretreatment/tree/master/boson/data
MSRANER	通用	48K	CN	NER	https://github.com/InsaneLife/ChineseNLPCorpus/tree/master/NER/MSRA
FinRe	金融	18K	CN	RE	https://github.com/thunlp/Chinese_NRE/tree/master/data/FinRE
SemEval-2010 Task 8	通用	10K	EN	RE	https://github.com/thunlp/OpenNRE/blob/master/benchmark/download_semeval.sh
TACRED	通用	106K	EN	NER, RE	https://github.com/yuhaozhang/tacred-relation/tree/master/dataset/tacred
NYT10	通用	694K	EN	RE	https://github.com/thunlp/OpenNRE/blob/master/benchmark/download_nyt10.sh
DocRED	通用	UNK	EN	RE	https://drive.google.com/drive/folders/1c5-0YwnoJx8NS6CV2f-NoTHR__BdkNqw
CLUENER2020	通用	12K	CN	NER	https://www.cluebenchmarks.com/introduce.html
Title2Event	新闻	42K	CN	EE	https://open-event-hub.github.io/title2event/
BioRED	生物医学	UNK	EN	RE	https://github.com/ncbi/BioRED
文娱NER-Youku	文娱	10K	CN	NER	https://github.com/allanj/ner_incomplete_annotation/tree/master/data/youku
CONLL2003	新闻	284K	EN	NER	https://github.com/allanj/ner_incomplete_annotation/tree/master/data/conll2003
电商NER-Taobao	电商	8K	CN	NER	https://github.com/allanj/ner_incomplete_annotation/tree/master/data/ecommerce
财经NER-新浪财经	金融	5K	CN	NER	https://github.com/jiesutd/LatticeLSTM/tree/master/data
人民日报-2014	新闻	286K	CN	NER	https://github.com/zjy-ucas/ChineseNER/tree/master/data
人民日报-1998	新闻	28K	CN	NER	https://github.com/zjy-ucas/ChineseNER/tree/master/data
智慧教育开放知识数据集-数据结构	教育	176K	CN	RE	https://blog.csdn.net/qq_36426650/article/details/87719204
智慧教育开放知识数据集-初中数学	教育	6K	CN	NER	https://blog.csdn.net/qq_36426650/article/details/87719204
智慧教育开放知识数据集-高中数学	教育	2K	CN	NER	https://blog.csdn.net/qq_36426650/article/details/87719204
军事装备试验鉴定-NER	军事	0.8K	CN	NER	https://github.com/hy-struggle/ccks_ner/tree/master/militray/PreModel_Encoder_CRF/data
CMeEE	医学	23K	CN	NER	https://tianchi.aliyun.com/dataset/95414
CMeIE	医学	22K	CN	RE	https://tianchi.aliyun.com/dataset/95414
银行借贷2021-NER	金融	10K	CN	NER	https://www.heywhale.com/mw/dataset/617969ec768f3b0017862990/file
SKE 2019	通用	210K	CN	RE	https://toscode.gitee.com/yiweilu/Entity-Relation-Extraction/tree/master/raw_data
任务对话2018-NER	通用	21K	CN	NER	http://tcci.ccf.org.cn/conference/2018/taskdata.php#
CoNLL04	新闻	9K	EN	RE	http://lavis.cs.hs-rm.de/storage/spert/public/datasets/conll04/
OntoNotes 4.0	新闻	50K	CN	NER	https://www.datafountain.cn/competitions/510/datasets
CCIR2021-NER	新闻	15K	CN	NER	https://www.datafountain.cn/competitions/510
firefly-train-1.1M	通用	50K	CN	NER	https://huggingface.co/datasets/YeungNLP/firefly-train-1.1M
IE INSTRUCTIONS	通用	UNK	EN	NER, RE, EE	https://drive.google.com/file/d/1T-5IbocGka35I7X3CE6yKe5N_Xg2lVKT/view
CCKS2017-NER	医疗	2K	CN	NER	https://www.biendata.xyz/competition/CCKS2017_1/
CCKS2018-NER	医疗	0.8K	CN	NER	https://www.biendata.xyz/competition/CCKS2018_1/
CCKS2019-NER	医疗	1.4K	CN	NER	https://www.biendata.xyz/competition/ccks_2019_1/
CCKS2020-NER	医疗	1.4K	CN	NER	https://www.biendata.xyz/competition/ccks_2020_2_1/
WeiBo	通用	1.8K	CN	NER	https://github.com/hltcoe/golden-horse
MMC	医疗	3.5K	CN	NER	https://tianchi.aliyun.com/dataset/88836
Resume	人文	4.8K	CN	NER	https://github.com/jiesutd/LatticeLSTM/tree/master/ResumeNER
SanWen-NER	中国文学	28K	CN	NER	https://github.com/thunlp/Chinese_NRE/tree/master/data/SanWen
WanChuang	医疗	1.2K	CN	NER	https://tianchi.aliyun.com/competition/entrance/531827/introduction
GAIIC2022_task2	电商	40K	CN	NER	https://www.heywhale.com/home/competition/620b34ed28270b0017b823ad/content/2
IMCS21_task1	医疗	98K	CN	NER	http://www.fudan-disc.com/sharedtask/imcs21/index.html

数据格式

我们集合中的所有数据均已被转化成相同的格式，每个样本的格式如下：

# NER
{
    "sentence": string,
    "entities": {
        "name": string,
        "type": string,
        "pos": [
          int,
          int
        ]
    }
}

# RE
{
    "sentence": string,
    "relations": [
        {
            "head": {
                "name": string,
                "type": string,
                "pos": [int, int]
            },
            "type": string,
            "tail": {
                "name": string,
                "type": string,
                "pos": [int, int]
            }
        }
    ]
}

# EE
{
    "sentence": string,
    "events": [
        {
            "trigger": string,
            "type": string,
            "pos": [
                int
            ],
            "arguments": [
                {
                    "name": string,
                    "role": string,
                    "pos": [
                        int
                    ]
                },
                {
                    "name": string,
                    "role": string,
                    "pos": [
                        int
                    ]
                }
            ]
        }
    ]
}

下载

你可以在这里下载所有我们已经统一格式后的数据。

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
static		static
.DS_Store		.DS_Store
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

通用信息抽取数据集收集(Viscacha)

数据集合 (Data Collection)

数据格式

下载

About

Releases

Packages

Contributors 2

License

hccngu/Viscacha

Folders and files

Latest commit

History

Repository files navigation

通用信息抽取数据集收集(Viscacha)

数据集合 (Data Collection)

数据格式

下载

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Packages