Skip to content

Latest commit

 

History

History
153 lines (132 loc) · 7.26 KB

README.md

File metadata and controls

153 lines (132 loc) · 7.26 KB

FlaCGEC

FlaCGEC: A Chinese Grammatical Error Correction Dataset with Fine-grained Linguistic Annotation

论文的PDF版本可以在以下链接中进行查看:FlaCGEC: A Chinese Grammatical Error Correction Dataset with Fine-grained Linguistic Annotation

如果您认为我们的工作对您的研究有帮助,请引用我们的论文:

@inproceedings{Hanyue_Du_CIKM23,
author = {Du, Hanyue and Zhao, Yike and Tian, Qingyuan and Wang, Jiani and Wang, Lei and Lan, Yunshi and Lu, Xuesong},
title = {FlaCGEC: A Chinese Grammatical Error Correction Dataset with Fine-Grained Linguistic Annotation},
year = {2023},
isbn = {9798400701245},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3583780.3615119},
doi = {10.1145/3583780.3615119},
booktitle = {Proceedings of the 32nd ACM International Conference on Information and Knowledge Management},
pages = {5321–5325},
numpages = {5},
keywords = {Chinese grammatical error correction, deep learning, fine-grained linguistic annotation},
location = {Birmingham, United Kingdom},
series = {CIKM '23}
}

Dataset Description

中文语法错误纠正 (CGEC) 旨在检测和纠正句子中的所有语法错误,已受到越来越多的研究人员的关注。尽管目前已经开发了多个 CGEC 数据集来支持研究,但这些数据集仍缺乏提供语法错误的深层语言拓扑的能力。为解决这个限制,本仓库提供了一个新的 CGEC 数据集:FlaCGEC,它具有细粒度的语言注释,包含 78 个实例化语法点和 3 种编辑类型, 数据的整体统计如下表所示。

Properties Train Dev Test
Sentences 10804 1334 1325
Average source sentence length 35.09 34.76 35.83
Average target sentence length 35.59 35.29 36.34
Edits per sentence 1.72 1.69 1.71
Grammar points 77 69 72

数据集下载地址见本仓库data文件夹:https://github.com/hyDududu/FlaCGEC/tree/main/data

Data Structure

FlaCGEC数据集以 JSON 文件形式进行存储,具体数据结构如下所示:

image-20230617144425265

Some Examples

下表中展示了 FlaCGEC 数据集的一些示例,一个句子可能存在多个错误,并且错误涉及句子的不同组成部分。

[S] 节日期间,饭店纷纷推出特色餐饮特惠措施,吸引市民走进饭店.
Translation: During the festival, per hotel introduces special cuisines promotion activities, attracting citizens to walk in.
[T] 节日期间,各饭店纷纷推出特色餐饮和特惠措施,吸引市民走进饭店。
Translation: During the festival, every hotel introduces special cuisines and promotion activities, attracting citizens to walk in.
[A] 5 5|||S-Demonstrative pronouns指示代词|||各;16 16|||M-Prepositions for objects介词引出对象|||和
[S] 睡觉时,身体感觉到,人就容易梦到什么内容。
Translation: During sleeping, people easily dream the bodies feel.
[T] 睡觉时,身体感觉到什么,人就容易梦到什么内容。
Translation: During sleeping, people easily dream what the bodies feel.
[A] 9 9|||M-Non-interrogative use of interrogative pronouns疑问词的非疑问用法|||什么
[S] 他很不服气地说:“我尽力而为了已经!”
Translation: He listens and said disgruntledly: “I already have tried !”
[T] 他听了很不服气地说:“我已经尽力而为了!”
Translation: He listened and said disgruntledly: “I have already tried !”
[A] 2 2|||M-Aspect particle动态助词|||了;16 17|||W-Adverbs of time时间副词|||None
[S] 但有没受到老板的责备,而且他心里很失落。
Translation: But did he receive the blame from his boss, and he is upset.
[T] 虽然没有受到老板的责备,但是他心里很失落。
Translation: Even though he did not receive the blame from his boss, he is upset.
[A] 0 0\|\|\|S-Conjunctions for connecting clauses介词连接分句|||虽然;2 2|||W-Negative adverb否定副词|||没;11 12|||W-Conjunctions for connecting clauses介词连接分句|||但是

Some Grammar Points

下表列出了部分实例化语法点、和它们相应的示例。

Grammar Points Instantiations Examples
Adverbs of degree[程度副词] 有的人很从容。
有点儿 左边这瓶有点儿酸。
Conjunctions for connecting clauses[介词连接分句] 如果 如果没有标记,散落的片断将…
因此 因此,人们以乌龟指长寿。
总之 总之,电视带给我们知识和娱乐。
Modal verbs[能愿动词] 需要 这项工程至少需要10年时间才能完工。
妈妈生病了,我得马上回国去看她。
Passive sentences[被动句] 自行车被当做一种交通工具。
被…所 快乐的人不会被痛苦所左右。
Successive complex sentences[承接复句] 于是 唐太宗很生气,于是召集群臣,当面训斥魏征。
便 司马光受父亲影响,自幼便聪明好学。