Language Identification of Kurdish & Zaza-Gorani Languages

Language identification or detection is the task of detecting the language in which a sentence is written. This repository provides models for language identificaiton of Kurdish and Zaza-Gorani languages with their Kurdified Perso-Arabic and Latin scripts. Our models can predict the following languages and scripts:

Northern Kurdish / کورمانجی (Kurmanji, kmr) - both scripts with kuarab & kulatn labels
Central Kurdish / سۆرانی (Sorani, ckb) - both scripts with ckbarab & ckblatn labels
Southern Kurdish / کوردیی خوارین (sdh)
Gorani / گۆرانی (Hawrami, hac)
Zazaki / Zazakî / (zza) - both scripts with zza for Bedirxan and zzawiki for the script used on Zazaki Wikipedia
Arabic / اَلْعَرَبِيَّةُ (ar)
Persian / فارسی (fa)
Turkish / Türkçe / (tr)

How to use?

Our models are trained using fastText. You can run the models in Python or on command-line by installing the fastTextlibrary as described at https://fasttext.cc/docs/en/support.html.

Two models are provided:

models/KLID_model.ftz: use this if you don't mind about detecting the script of the language. This predicts language codes only.
models/KLID_model_scr.ftz: use this if you want the script label in addition to the language code. This predicts language and script.

Here is an example in Python:

>>> import fasttext
>>> model = fasttext.load_model("models/KLID_model.ftz")

# Central Kurdish
>>> model.predict("لەزۆربەی یارییەکان گوڵ تۆمار دەکات") 
(('__label__ckb',), array([1.00002003]))
>>> model.predict("لەزۆربەی یارییەکان گوڵ تۆمار دەکات", k=5)
(('__label__ckb', '__label__ku'), array([1.00002003e+00, 1.00000989e-05]))
>>> model.predict("لەزۆربەی یارییەکان گوڵ تۆمار دەکات")
(('__label__ckb',), array([1.00002003]))
>>> model.predict("باڵیۆزی عێراق")
(('__label__ckb',), array([1.00001979]))

# Southern Kurdish
>>> model.predict("چەس ئمڕوو چە قوومیاس؟!!") 
(('__label__sdh',), array([1.00003743]))

# Gorani
>>> model.predict("داستانێ فرەتەر و درێژتەرەنه و دەسی سەر پەی") 
(('__label__hac',), array([0.99998134]))

# Kurmanji
>>> model.predict("ئەگەر بێژم ئەز فەرهادم") 
(('__label__ku',), array([0.93445575]))

# Zazaki
>>> model.predict("Seba naye zî ganî ma rayîr û metodanê xo xurtêr bikerê.") 
(('__label__zza',), array([1.00003004]))

# Northern Kurdish
>>> model.predict("Amerîkayîyan di sala 2004 de zîndana Ebû Xerîb girtin.") 
(('__label__ku',), array([0.99766862]))

# Central Kurdish
>>> model.predict("Emin filsêkim le kitêban dest nekewtbû bełam") 
(('__label__ckb',), array([1.00001991]))

# Central Kurdish
>>> model.predict("گەرەکمە پێی بێژم نامگەرەکە") 
(('__label__ku',), array([0.99485904])) 
>>> model.predict("جا ئەتوو وەرە دەگەڵ وی ڕێک کەوە")
(('__label__sdh',), array([0.84034669])) 

# English
>>> model.predict("To be, or not to be") 
(('__label__zza',), array([1.00003004]))

If you would like to train your own models, you can use the datasets provided in the datasets folder. All the datasets are merged into train and train_scr; these two files refer to the instances tagged without and with their scripts, respectively.

Cite this corpus

If you're using the models, please cite the project along with the following paper (bib file | PDF).

@inproceedings{ahmadi2023fieldmatters,
  title = "Approaches to Corpus Creation for Low-Resource Language Technology: the Case of {Southern Kurdish and Laki}",
  author = "Ahmadi, Sina and Azin, Zahra and Belelli, Sara and Anastasopoulos, Antonios",
  booktitle = "Proceedings of the second workshop on NLP applications to field linguistics",
  month = may,
  year = "2023",
  address = "Dubrovnik, Croatia",
  publisher = "The 17th Conference of the European Chapter of the Association for Computational Linguistics"
}

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
datasets		datasets
models		models
.gitignore		.gitignore
Kurdish-alphabets.png		Kurdish-alphabets.png
LICENSE		LICENSE
README.md		README.md
merger.sh		merger.sh
run_fasttext.sh		run_fasttext.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Language Identification of Kurdish & Zaza-Gorani Languages

How to use?

Cite this corpus

License

About

Languages

License

sinaahmadi/KurdishLID

Folders and files

Latest commit

History

Repository files navigation

Language Identification of Kurdish & Zaza-Gorani Languages

How to use?

Cite this corpus

License

About

Topics

Resources

License

Stars

Watchers

Forks

Languages