Arabic Dialect Classification

Abstract

Though Arabic is the lingua franca of most of the MENA region, different dialects of Arabic are used in different countries. While some dialects may differ significantly from each other, others, particularly those in close in geographic proximity, may be more difficult to tweak apart. Automatically distinguishing between the different dialectal variations is valuable for many downstream applications such as machine translations, POS tagging, geo-locating users, and author profiling.

Tried Methods

TF-IDF + Linear SVM
Keras Embedding Layer + LSTM
Arabert Transformer Embedding + LSTM
Fine-tuning Arabert Transformer

Training Dataset

QADI dataset was used for this specific task. In this paper, the researcher has automatically collected a dataset of tweets belonging to a wide range of country-level Arabic dialects covering 18 different countries in the Middle East and North Africa region. Building this dataset relied on applying multiple filters to identify users who belong to different countries based on their account descriptions and to eliminate tweets that are either written in Modern Standard Arabic or contain inappropriate language. Using intrinsic evaluation, the labels of a set of randomly selected tweets were 91.5% accurate.

Results

Model	Accuracy
TF-IDF + Linear SVM	56.1%
Keras Embedding Layer + LSTM	52.1%
Arabert Transformer Embedding + LSTM	52.5%
Fine-tuning Arabert Transformer	60.5%

Application

The SVM model and Fine-tuned Transformer were the models used to build an API using a new easy web framework which is FastAPI. In order to try out the application, please follow the following instructions:

Install all Python libraries that the notebooks depend on:

pip install -r requirements.txt

Download files needed from these links: TF-IDF / SVM-Model / Finetuned-Transformer-Weights
Clone the Arabert repo:

git clone https://github.com/aub-mind/arabert.git

Run the server:

python App_FastAPI.py -t [Tfidf Path] -ml [SVM Model Path] -dl [Transformer Weights Path]

Navigate to your local host http://localhost:8000/docs
Assign the text that needs to be classified

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
TrainingNotebooks		TrainingNotebooks
images		images
AIM_Fetching.ipynb		AIM_Fetching.ipynb
AIM_Preprocessing.ipynb		AIM_Preprocessing.ipynb
App_FastAPI.py		App_FastAPI.py
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Arabic Dialect Classification

Abstract

Tried Methods

Training Dataset

Results

Application

Demo

About

Releases

Packages

Languages

MohamedHamayed/Arabic-Dialect-Classification

Folders and files

Latest commit

History

Repository files navigation

Arabic Dialect Classification

Abstract

Tried Methods

Training Dataset

Results

Application

Demo

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages