Skip to content

hccngu/Viscacha

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

28 Commits
 
 
 
 
 
 
 
 

Repository files navigation

通用信息抽取数据集收集(Viscacha)

LICENSE torch

这是Viscacha项目的存储库,该项目旨在构建一个通用信息抽取数据集的统一收集工作。

欢迎您向我们提供任何未收集的信息抽取数据集(或其来源)。我们将统一它们的格式,并通过我们所构建的instructions融入统一的数据集中。

数据集合 (Data Collection)

语言:

  • EN: English (英文)
  • CN: Chinese (中文)
  • ML: Multiple languages (多语言)

任务:

  • NER: Named Entity Recognition (命名实体识别)
  • RE: Relation Extraction (关系抽取)
  • EE: Event Extraction (事件抽取)
数据集 领域 数目 语言 任务 来源
DuIE2.0 人文 191K CN RE https://www.luge.ai/#/luge/dataDetail?id=5
DuEE1.0 新闻 17K CN EE https://www.luge.ai/#/luge/dataDetail?id=6
DuEE-fin 金融 11.7K CN EE https://www.luge.ai/#/luge/dataDetail?id=7
IREE 金融 50K CN EE https://www.luge.ai/#/luge/dataDetail?id=72
SanWen 中国文学 21K CN RE https://github.com/thunlp/Chinese_NRE/tree/master/data/SanWen
BosonNER 通用 10K CN NER https://github.com/HuHsinpang/BosonNER-Pretreatment/tree/master/boson/data
MSRANER 通用 48K CN NER https://github.com/InsaneLife/ChineseNLPCorpus/tree/master/NER/MSRA
FinRe 金融 18K CN RE https://github.com/thunlp/Chinese_NRE/tree/master/data/FinRE
SemEval-2010 Task 8 通用 10K EN RE https://github.com/thunlp/OpenNRE/blob/master/benchmark/download_semeval.sh
TACRED 通用 106K EN NER, RE https://github.com/yuhaozhang/tacred-relation/tree/master/dataset/tacred
NYT10 通用 694K EN RE https://github.com/thunlp/OpenNRE/blob/master/benchmark/download_nyt10.sh
DocRED 通用 UNK EN RE https://drive.google.com/drive/folders/1c5-0YwnoJx8NS6CV2f-NoTHR__BdkNqw
CLUENER2020 通用 12K CN NER https://www.cluebenchmarks.com/introduce.html
Title2Event 新闻 42K CN EE https://open-event-hub.github.io/title2event/
BioRED 生物医学 UNK EN RE https://github.com/ncbi/BioRED
文娱NER-Youku 文娱 10K CN NER https://github.com/allanj/ner_incomplete_annotation/tree/master/data/youku
CONLL2003 新闻 284K EN NER https://github.com/allanj/ner_incomplete_annotation/tree/master/data/conll2003
电商NER-Taobao 电商 8K CN NER https://github.com/allanj/ner_incomplete_annotation/tree/master/data/ecommerce
财经NER-新浪财经 金融 5K CN NER https://github.com/jiesutd/LatticeLSTM/tree/master/data
人民日报-2014 新闻 286K CN NER https://github.com/zjy-ucas/ChineseNER/tree/master/data
人民日报-1998 新闻 28K CN NER https://github.com/zjy-ucas/ChineseNER/tree/master/data
智慧教育开放知识数据集-数据结构 教育 176K CN RE https://blog.csdn.net/qq_36426650/article/details/87719204
智慧教育开放知识数据集-初中数学 教育 6K CN NER https://blog.csdn.net/qq_36426650/article/details/87719204
智慧教育开放知识数据集-高中数学 教育 2K CN NER https://blog.csdn.net/qq_36426650/article/details/87719204
军事装备试验鉴定-NER 军事 0.8K CN NER https://github.com/hy-struggle/ccks_ner/tree/master/militray/PreModel_Encoder_CRF/data
CMeEE 医学 23K CN NER https://tianchi.aliyun.com/dataset/95414
CMeIE 医学 22K CN RE https://tianchi.aliyun.com/dataset/95414
银行借贷2021-NER 金融 10K CN NER https://www.heywhale.com/mw/dataset/617969ec768f3b0017862990/file
SKE 2019 通用 210K CN RE https://toscode.gitee.com/yiweilu/Entity-Relation-Extraction/tree/master/raw_data
任务对话2018-NER 通用 21K CN NER http://tcci.ccf.org.cn/conference/2018/taskdata.php#
CoNLL04 新闻 9K EN RE http://lavis.cs.hs-rm.de/storage/spert/public/datasets/conll04/
OntoNotes 4.0 新闻 50K CN NER https://www.datafountain.cn/competitions/510/datasets
CCIR2021-NER 新闻 15K CN NER https://www.datafountain.cn/competitions/510
firefly-train-1.1M 通用 50K CN NER https://huggingface.co/datasets/YeungNLP/firefly-train-1.1M
IE INSTRUCTIONS 通用 UNK EN NER, RE, EE https://drive.google.com/file/d/1T-5IbocGka35I7X3CE6yKe5N_Xg2lVKT/view
CCKS2017-NER 医疗 2K CN NER https://www.biendata.xyz/competition/CCKS2017_1/
CCKS2018-NER 医疗 0.8K CN NER https://www.biendata.xyz/competition/CCKS2018_1/
CCKS2019-NER 医疗 1.4K CN NER https://www.biendata.xyz/competition/ccks_2019_1/
CCKS2020-NER 医疗 1.4K CN NER https://www.biendata.xyz/competition/ccks_2020_2_1/
WeiBo 通用 1.8K CN NER https://github.com/hltcoe/golden-horse
MMC 医疗 3.5K CN NER https://tianchi.aliyun.com/dataset/88836
Resume 人文 4.8K CN NER https://github.com/jiesutd/LatticeLSTM/tree/master/ResumeNER
SanWen-NER 中国文学 28K CN NER https://github.com/thunlp/Chinese_NRE/tree/master/data/SanWen
WanChuang 医疗 1.2K CN NER https://tianchi.aliyun.com/competition/entrance/531827/introduction
GAIIC2022_task2 电商 40K CN NER https://www.heywhale.com/home/competition/620b34ed28270b0017b823ad/content/2
IMCS21_task1 医疗 98K CN NER http://www.fudan-disc.com/sharedtask/imcs21/index.html

数据格式

我们集合中的所有数据均已被转化成相同的格式,每个样本的格式如下:

# NER
{
    "sentence": string,
    "entities": {
        "name": string,
        "type": string,
        "pos": [
          int,
          int
        ]
    }
}

# RE
{
    "sentence": string,
    "relations": [
        {
            "head": {
                "name": string,
                "type": string,
                "pos": [int, int]
            },
            "type": string,
            "tail": {
                "name": string,
                "type": string,
                "pos": [int, int]
            }
        }
    ]
}

# EE
{
    "sentence": string,
    "events": [
        {
            "trigger": string,
            "type": string,
            "pos": [
                int
            ],
            "arguments": [
                {
                    "name": string,
                    "role": string,
                    "pos": [
                        int
                    ]
                },
                {
                    "name": string,
                    "role": string,
                    "pos": [
                        int
                    ]
                }
            ]
        }
    ]
}

下载

你可以在这里下载所有我们已经统一格式后的数据。

About

Viscacha:通用信息抽取数据集收集

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published