Online Hate Speech and Misogyny Detection in the Albanian Language

Project Overview

This repository contains the code and data for the research project aimed at detecting hate speech and misogyny in the Albanian language. The project explores various text analysis algorithms, including traditional machine learning models and deep learning approaches, to identify and classify hate speech effectively. The study focuses on the application of different text vectorization techniques such as TF-IDF, Word2Vec, and BERT embeddings and evaluating their results respecively.

Data Description

Dataset: The dataset (merged_labeled_dataset.csv) contains user comments in the Albanian language, manually annotated with labels indicating whether the comment contains hate speech or not.
Classes: The dataset includes two classes:
- 0: Non-hateful comments
- 1: Hateful comments, including misogyny

Preprocessing

The preprocessing steps include:

Removal or stop-words, punctuation, links, mentions, etc.
Tokenization using the different techniques (BERT Embeddings, TD-IDF, Word2Vec)
PCA is applied to reduce the dimensionality of the BERT embeddings.
SMOTE is set to balance the dataset by oversampling the minority class.
Features are scaled using StandardScaler.

The general preprocessing code is available under the data directory, while the tokenization is done seperately and can be found specifically in the bert, td-idf and word2vec directories.

Models

The following models were trained and evaluated:

Traditional Machine Learning Models:
- Logistic Regression
- Random Forest
- Support Vector Classifier (SVC)
Deep Learning Models:
- Convolutional Neural Network (CNN)
- Long Short-Term Memory (LSTM)

These models were trained using the following vectorization techniques:

TF-IDF
Word2Vec
BERT Embeddings

Model training scripts are located in the respective tokenization directories.

Performance Metrics

The models were evaluated using various performance metrics:

Accuracy
Precision
Recall
F1-Score
ROC Curve and AUC

The best performing model was the Random Forest with Word2Vec embeddings, achieving an accuracy of 92.5% and an F1-score of 93%.

Future Work

Fine-tuning of Advanced Models: Further research is needed to fine-tune advanced models like BERT and explore deep learning architectures like BiLSTM for better performance in the Albanian context.
Dataset Expansion: There is a need for larger and more diverse datasets to improve model generalization and accuracy.

Contributors

[Sindi Buklaji] - Student at the Technical University of Munich

Feel free to contribute to the repository for further improvements!

Name		Name	Last commit message	Last commit date
Latest commit History 76 Commits
bert		bert
data		data
neural network		neural network
plots		plots
td-idf		td-idf
word2vec		word2vec
README.md		README.md
word2vec_model.bin		word2vec_model.bin

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Online Hate Speech and Misogyny Detection in the Albanian Language

Project Overview

Table of Contents

Data Description

Preprocessing

Models

Performance Metrics

Future Work

Contributors

About

Releases

Packages

Languages

SindiBuklaji/hatespeech_al

Folders and files

Latest commit

History

Repository files navigation

Online Hate Speech and Misogyny Detection in the Albanian Language

Project Overview

Table of Contents

Data Description

Preprocessing

Models

Performance Metrics

Future Work

Contributors

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages