Skip to content

MohamedHamayed/Arabic-Dialect-Classification

Repository files navigation

Arabic Dialect Classification

Abstract

Though Arabic is the lingua franca of most of the MENA region, different dialects of Arabic are used in different countries. While some dialects may differ significantly from each other, others, particularly those in close in geographic proximity, may be more difficult to tweak apart. Automatically distinguishing between the different dialectal variations is valuable for many downstream applications such as machine translations, POS tagging, geo-locating users, and author profiling.

Tried Methods

  1. TF-IDF + Linear SVM
  2. Keras Embedding Layer + LSTM
  3. Arabert Transformer Embedding + LSTM
  4. Fine-tuning Arabert Transformer

Training Dataset

QADI dataset was used for this specific task. In this paper, the researcher has automatically collected a dataset of tweets belonging to a wide range of country-level Arabic dialects covering 18 different countries in the Middle East and North Africa region. Building this dataset relied on applying multiple filters to identify users who belong to different countries based on their account descriptions and to eliminate tweets that are either written in Modern Standard Arabic or contain inappropriate language. Using intrinsic evaluation, the labels of a set of randomly selected tweets were 91.5% accurate.

Results

Model Accuracy
TF-IDF + Linear SVM 56.1%
Keras Embedding Layer + LSTM 52.1%
Arabert Transformer Embedding + LSTM 52.5%
Fine-tuning Arabert Transformer 60.5%

Application

The SVM model and Fine-tuned Transformer were the models used to build an API using a new easy web framework which is FastAPI. In order to try out the application, please follow the following instructions:

  1. Install all Python libraries that the notebooks depend on:
pip install -r requirements.txt
  1. Download files needed from these links: TF-IDF / SVM-Model / Finetuned-Transformer-Weights

  2. Clone the Arabert repo:

git clone https://github.com/aub-mind/arabert.git
  1. Run the server:
python App_FastAPI.py -t [Tfidf Path] -ml [SVM Model Path] -dl [Transformer Weights Path]
  1. Navigate to your local host http://localhost:8000/docs

  2. Assign the text that needs to be classified

Demo

Simply Easy Learning

Simply Easy Learning

Simply Easy Learning

Simply Easy Learning

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published