Though Arabic is the lingua franca of most of the MENA region, different dialects of Arabic are used in different countries. While some dialects may differ significantly from each other, others, particularly those in close in geographic proximity, may be more difficult to tweak apart. Automatically distinguishing between the different dialectal variations is valuable for many downstream applications such as machine translations, POS tagging, geo-locating users, and author profiling.
- TF-IDF + Linear SVM
- Keras Embedding Layer + LSTM
- Arabert Transformer Embedding + LSTM
- Fine-tuning Arabert Transformer
QADI dataset was used for this specific task. In this paper, the researcher has automatically collected a dataset of tweets belonging to a wide range of country-level Arabic dialects covering 18 different countries in the Middle East and North Africa region. Building this dataset relied on applying multiple filters to identify users who belong to different countries based on their account descriptions and to eliminate tweets that are either written in Modern Standard Arabic or contain inappropriate language. Using intrinsic evaluation, the labels of a set of randomly selected tweets were 91.5% accurate.
Model | Accuracy |
---|---|
TF-IDF + Linear SVM | 56.1% |
Keras Embedding Layer + LSTM | 52.1% |
Arabert Transformer Embedding + LSTM | 52.5% |
Fine-tuning Arabert Transformer | 60.5% |
The SVM model and Fine-tuned Transformer were the models used to build an API using a new easy web framework which is FastAPI. In order to try out the application, please follow the following instructions:
- Install all Python libraries that the notebooks depend on:
pip install -r requirements.txt
-
Download files needed from these links: TF-IDF / SVM-Model / Finetuned-Transformer-Weights
-
Clone the Arabert repo:
git clone https://github.com/aub-mind/arabert.git
- Run the server:
python App_FastAPI.py -t [Tfidf Path] -ml [SVM Model Path] -dl [Transformer Weights Path]
-
Navigate to your local host
http://localhost:8000/docs
-
Assign the text that needs to be classified