Code for training and application of a language identification model. The model has been trained on the WiLI-2018 - Wikipedia Language Identification database. The classifier is an XGBoost model and achieves an accuracy of 85.97% on the WiLi test dataset for 235 languages.
- Python 3.7+
- xgboost==1.3.3
- scikit-learn==0.23.1
- gensim, sklearn, pandas, joblib
python src/language_identifier.py [INPUT_FILE.txt]
The input file can be any multi-lined document with text.
The model is built using XGBoost, a decision-tree-based ensemble Machine Learning algorithm that uses a gradient boosting framework (Chen and He, 2021). Using XGBoost for multi-class text classification has been used by Avishek Nag. To the best of my knowledge, XGBoost has not been applied for language identification. The data has been preprocessed and piped through sklearn's TfidfVectorizer. The training used the XGBoost objective of 'multi:softmax' for categorial data and took 33.5min (with an AMD Ryzen7 3700X (8-core, 16 threads)). The TfidfVectorizer is needed (scikit-learn==0.23.1) to transform new data before predicting its language label with the classifier, therefore the TfidfVectorizer is provided as tfidf_transformer.joblib.dat
joblib archive. The XGBoost model (xgboost==1.3.3) is provided as xgboost_model.joblib.dat
joblib archive.
The WiLI-2018 - Wikipedia Language Identification database contains 235000 paragraphs of 235 languages. Language labels for identification and prediction, as well as links to train and test data are provided in the data folders. The languages range from Achinese, Afrikaans and Alemannic German to Yiddish, Yoruba to Zeeuws.
The first attempt of native language identification was a simple n-gram based approach, which is one of the most common approaches to this problem. Code for this attempt is stored in the scr/naive_bayes
directory. While it is know that this approach gives fairly good results, the implemented naive bayes has been included for the sake of completion. The XGBoost model demonstrates an attempt at using a state-of-the-art machine learning model for language classification with 235 languages without any ngram parsing.