Skip to content

A repository for our AAAI-2020 Cross-lingual-NER paper. Code will be updated shortly.

Notifications You must be signed in to change notification settings

ntunlp/Zero-Shot-Cross-Lingual-NER

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 

Repository files navigation

Zero-Shot-Cross-Lingual-NER

Repository for Zero-Resource Cross-Lingual Named Entity Recognition AAAI'20 paper.

News

  • Dataset released
    • dataset for finnish (fi) and arabic (ar) is updated. Please check here.
    • the remaining datasets are from conll en, es, de and nl. They are available in the respective source.

Language short form

Language Three letter Standard
English eng en
Spanish esp es
Dutch ned nl
German deu de
Arabic arb ar

This repository uses standard short form of the languages. Note: conll uses three letter short form.

Dataset

  1. English (en) : CoNLL-2003 shared task.

  2. Spanish (es) : CoNLL-2002 shared task.

  3. Dutch (es) : CoNLL-2002 shared task.

  4. German (de) : CoNLL-2003 shared task

  5. finnish (fi) : Adapted from this repository. However, they don't come in a form so that we can perform transfer learning experiments (from en conll NER dataset to fi dataset). We refactored the original source and corrected some tags manually for standardization.

  6. Arabic (ar) : Adapted from here. However, they don't come in a form so that we have a proper train, dev, test split. Dataset comes with 28 manually annotated wikipedia articles. For train, dev and test split creation, we randomly select sentences from each of the article and add it to a train, dev and test split. Split size, train(~90%), dev(~10%), test(~10%). Few tags and/or tokens are manually altered for standardization so that we can perform transfer learning experiments.

If you are using refined Finnish NER dataset please cite the following papers,

@inproceedings{bari19,
	Address     = {New York, USA},
	Author      = {M Saiful Bari and Shafiq Joty and Prathyusha Jwalapuram},
	Booktitle   = {Proceedings of the 34th AAAI Conference on Artificial Intelligence},
	Numpages    = {},
	Publisher   = {AAAI},
	Series      = {AAAI '20},
        pages       = {xx--xx},
	Title       = {{Zero-Resource Cross-Lingual Named Entity Recognition}},
	Year        = {2020},
	url         = {}
}
@article{Ruokolainen_2019,
   title={A Finnish news corpus for named entity recognition},
   ISSN={1574-0218},
   url={http://dx.doi.org/10.1007/s10579-019-09471-7},
   DOI={10.1007/s10579-019-09471-7},
   journal={Language Resources and Evaluation},
   publisher={Springer Science and Business Media LLC},
   author={Ruokolainen, Teemu and Kauppinen, Pekka and Silfverberg, Miikka and Lindén, Krister},
   year={2019},
   month={Aug}
}

If you are using refined Arabic NER dataset please cite the following papers,

@inproceedings{bari19,
	Address     = {New York, USA},
	Author      = {M Saiful Bari and Shafiq Joty and Prathyusha Jwalapuram},
	Booktitle   = {Proceedings of the 34th AAAI Conference on Artificial Intelligence},
	Numpages    = {},
	Publisher   = {AAAI},
	Series      = {AAAI '20},
        pages       = {xx--xx},
	Title       = {{Zero-Resource Cross-Lingual Named Entity Recognition}},
	Year        = {2020},
	url         = {}
}
@inproceedings{AQMAR,
 author = {Mohit, Behrang and Schneider, Nathan and Bhowmick, Rishav and Oflazer, Kemal and Smith, Noah A.},
 title = {Recall-oriented Learning of Named Entities in Arabic Wikipedia},
 booktitle = {EACL},
 series = {EACL '12},
 year = {2012},
 isbn = {978-1-937284-19-0},
 location = {Avignon, France},
 pages = {162--173},
 numpages = {12},
 url = {http://dl.acm.org/citation.cfm?id=2380816.2380839},
 acmid = {2380839},
 publisher = {ACL},
 address = {Stroudsburg, PA, USA},
} 

About

A repository for our AAAI-2020 Cross-lingual-NER paper. Code will be updated shortly.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published