Repository for Zero-Resource Cross-Lingual Named Entity Recognition
AAAI'20 paper.
- Dataset released
- dataset for finnish (fi) and arabic (ar) is updated. Please check here.
- the remaining datasets are from conll
en
,es
,de
andnl
. They are available in the respective source.
Language | Three letter | Standard |
---|---|---|
English | eng | en |
Spanish | esp | es |
Dutch | ned | nl |
German | deu | de |
Arabic | arb | ar |
This repository uses standard short form of the languages. Note: conll uses three letter short form.
-
English (en) :
CoNLL-2003 shared task
. -
Spanish (es) :
CoNLL-2002 shared task
. -
Dutch (es) :
CoNLL-2002 shared task
. -
German (de) :
CoNLL-2003 shared task
-
finnish (fi) : Adapted from this repository. However, they don't come in a form so that we can perform transfer learning experiments (from en conll NER dataset to fi dataset). We refactored the original source and corrected some tags manually for standardization.
-
Arabic (ar) : Adapted from here. However, they don't come in a form so that we have a proper train, dev, test split. Dataset comes with 28 manually annotated wikipedia articles. For train, dev and test split creation, we randomly select sentences from each of the article and add it to a train, dev and test split. Split size, train(~90%), dev(~10%), test(~10%). Few tags and/or tokens are manually altered for standardization so that we can perform transfer learning experiments.
If you are using refined
Finnish NER dataset please cite the following papers,
@inproceedings{bari19,
Address = {New York, USA},
Author = {M Saiful Bari and Shafiq Joty and Prathyusha Jwalapuram},
Booktitle = {Proceedings of the 34th AAAI Conference on Artificial Intelligence},
Numpages = {},
Publisher = {AAAI},
Series = {AAAI '20},
pages = {xx--xx},
Title = {{Zero-Resource Cross-Lingual Named Entity Recognition}},
Year = {2020},
url = {}
}
@article{Ruokolainen_2019,
title={A Finnish news corpus for named entity recognition},
ISSN={1574-0218},
url={http://dx.doi.org/10.1007/s10579-019-09471-7},
DOI={10.1007/s10579-019-09471-7},
journal={Language Resources and Evaluation},
publisher={Springer Science and Business Media LLC},
author={Ruokolainen, Teemu and Kauppinen, Pekka and Silfverberg, Miikka and Lindén, Krister},
year={2019},
month={Aug}
}
If you are using refined
Arabic NER dataset please cite the following papers,
@inproceedings{bari19,
Address = {New York, USA},
Author = {M Saiful Bari and Shafiq Joty and Prathyusha Jwalapuram},
Booktitle = {Proceedings of the 34th AAAI Conference on Artificial Intelligence},
Numpages = {},
Publisher = {AAAI},
Series = {AAAI '20},
pages = {xx--xx},
Title = {{Zero-Resource Cross-Lingual Named Entity Recognition}},
Year = {2020},
url = {}
}
@inproceedings{AQMAR,
author = {Mohit, Behrang and Schneider, Nathan and Bhowmick, Rishav and Oflazer, Kemal and Smith, Noah A.},
title = {Recall-oriented Learning of Named Entities in Arabic Wikipedia},
booktitle = {EACL},
series = {EACL '12},
year = {2012},
isbn = {978-1-937284-19-0},
location = {Avignon, France},
pages = {162--173},
numpages = {12},
url = {http://dl.acm.org/citation.cfm?id=2380816.2380839},
acmid = {2380839},
publisher = {ACL},
address = {Stroudsburg, PA, USA},
}