Skip to content

mickeyzo12/cld3

 
 

Repository files navigation

Compact Language Detector v3 (CLD3)

Model

CLD3 is a neural network model for language identification. This package contains the inference code and a trained model. The inference code extracts character ngrams from the input text and computes the fraction of times each of them appears. For example, as shown in the figure below, if the input text is "banana", then one of the extracted trigrams is "ana" and the corresponding fraction is 2/4. The ngrams are hashed down to an id within a small range, and each id is represented by a dense embedding vector estimated during training.

The model averages the embeddings corresponding to each ngram type according to the fractions, and the averaged embeddings are concatenated to produce the embedding layer. The remaining components of the network are a hidden (Rectified linear) layer and a softmax layer.

To get a language prediction for the input text, we simply perform a forward pass through the network.

Figure

Supported Languages

The model outputs BCP-47-style language codes, shown in the table below. For some languages, output is differentiated by script. Language and script names from Unicode CLDR.

Output Code Language Name Script Name
af Afrikaans Latin
am Amharic Ethiopic
ar Arabic Arabic
bg Bulgarian Cyrillic
bg-Latn Bulgarian Latin
bn Bangla Bangla
bs Bosnian Latin
ca Catalan Latin
ceb Cebuano Latin
co Corsican Latin
cs Czech Latin
cy Welsh Latin
da Danish Latin
de German Latin
el Greek Greek
el-Latn Greek Latin
en English Latin
eo Esperanto Latin
es Spanish Latin
et Estonian Latin
eu Basque Latin
fa Persian Arabic
fi Finnish Latin
fil Filipino Latin
fr French Latin
fy Western Frisian Latin
ga Irish Latin
gd Scottish Gaelic Latin
gl Galician Latin
gu Gujarati Gujarati
ha Hausa Latin
haw Hawaiian Latin
hi Hindi Devanagari
hi-Latn Hindi Latin
hmn Hmong Latin
hr Croatian Latin
ht Haitian Creole Latin
hu Hungarian Latin
hy Armenian Armenian
id Indonesian Latin
ig Igbo Latin
is Icelandic Latin
it Italian Latin
iw Hebrew Hebrew
ja Japanese Japanese
ja-Latn Japanese Latin
jv Javanese Latin
ka Georgian Georgian
kk Kazakh Cyrillic
km Khmer Khmer
kn Kannada Kannada
ko Korean Korean
ku Kurdish Latin
ky Kyrgyz Cyrillic
la Latin Latin
lb Luxembourgish Latin
lo Lao Lao
lt Lithuanian Latin
lv Latvian Latin
mg Malagasy Latin
mi Maori Latin
mk Macedonian Cyrillic
ml Malayalam Malayalam
mn Mongolian Cyrillic
mr Marathi Devanagari
ms Malay Latin
mt Maltese Latin
my Burmese Myanmar
ne Nepali Devanagari
nl Dutch Latin
no Norwegian Latin
ny Nyanja Latin
pa Punjabi Gurmukhi
pl Polish Latin
ps Pashto Arabic
pt Portuguese Latin
ro Romanian Latin
ru Russian Cyrillic
ru-Latn Russian English
sd Sindhi Arabic
si Sinhala Sinhala
sk Slovak Latin
sl Slovenian Latin
sm Samoan Latin
sn Shona Latin
so Somali Latin
sq Albanian Latin
sr Serbian Cyrillic
st Southern Sotho Latin
su Sundanese Latin
sv Swedish Latin
sw Swahili Latin
ta Tamil Tamil
te Telugu Telugu
tg Tajik Cyrillic
th Thai Thai
tr Turkish Latin
uk Ukrainian Cyrillic
ur Urdu Arabic
uz Uzbek Latin
vi Vietnamese Latin
xh Xhosa Latin
yi Yiddish Hebrew
yo Yoruba Latin
zh Chinese Han (including Simplified and Traditional)
zh-Latn Chinese Latin
zu Zulu Latin

Installation

CLD3 is designed to run in the Chrome browser, so it relies on code in Chromium. The steps for building and running the demo of the language detection model are:

  • check out the Chromium repository.
  • copy the code to //third_party/cld_3
  • Uncomment language_identifier_main executable in src/BUILD.gn.
  • build and run the model using the commands:
gn gen out/Default
ninja -C out/Default third_party/cld_3/src/src:language_identifier_main
out/Default/language_identifier_main

Bugs and Feature Requests

Open a GitHub issue for this repository to file bugs and feature requests.

Announcements and Discussion

For announcements regarding major updates as well as general discussion list, please subscribe to: cld3-users@googlegroups.com

Credits

Original authors of the code in this package include (in alphabetical order):

  • Alex Salcianu
  • Andy Golding
  • Anton Bakalov
  • Chris Alberti
  • Daniel Andor
  • David Weiss
  • Emily Pitler
  • Greg Coppola
  • Jason Riesa
  • Kuzman Ganchev
  • Michael Ringgaard
  • Nan Hua
  • Ryan McDonald
  • Slav Petrov
  • Stefan Istrate
  • Terry Koo

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • C++ 99.6%
  • Other 0.4%