Skip to content

An NLP research project utilizing the "cardiffnlp/twitter-roberta-base-sentiment-latest" pre-trained transformer for tweet tokenization. The project includes an attention-based biLSTM model that predicts sentiment labels for tweets as negative (-1), neutral (0), or positive (1).

License

Notifications You must be signed in to change notification settings

Abrar2652/nlp-roBERTa-biLSTM-attention

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

62 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

nlp-roBERTa-biLSTM-attention

The code repository for the Nature Scientific Reports (2023) paper Interpretable Sentiment Analysis of COVID-19 Tweets using Attention-based BiLSTM and Twitter-RoBERTa

DOI

Created and maintained by Md Abrar Jahin <abrar.jahin.2652@gmail.com, mdabrar.jahin@oist.jp>.

Datasets

Extended Datasets

In order to address the research gaps identified by Qi and Shabrina (2023), we have expanded the existing COVID-19 Twitter dataset. Our datasets overcome the limitations highlighted in their paper, specifically the short timeline and geographical constraints of the tweets. Each dataset includes a column of cleaned tweets, which have undergone preprocessing of the raw tweets and comments. Additionally, the datasets are accompanied by sentiment labels categorizing the tweets as negative (-1), neutral (0), or positive (1).

External Datasets

To assess the robustness and generalizability of our proposed model, we employed external datasets for benchmarking purposes. These additional datasets were utilized to evaluate how well our model performs beyond the confines of the original dataset used for training and testing. By incorporating these external datasets, we aimed to obtain a more comprehensive understanding of our model's capabilities and its ability to handle diverse and unseen data. The inclusion of these benchmark datasets allowed us to gauge the model's performance under varying conditions and validate its effectiveness in real-world scenarios.

Datasets Description
UK Twitter COVID-19 Dataset This dataset was developed by collecting COVID-19 tweets from only the major cities in the UK (Qi and Shabrina, 2023)
Global Twitter COVID-19 Dataset We extended the existing UK COVID-19 dataset by scraping additional 411885 tweets from 32 English-speaking countries
USA Twitter COVID-19 Dataset We extended the existing UK COVID-19 dataset by scraping additional 7500 tweets from only the USA
External Reddit Dataset 36801 comments
External Twitter Dataset 162980 tweets
External Apple Twitter Dataset 1630 tweets
External US Airline Twitter Dataset 14640 tweets

Classical Models

Qi and Shabrina (2023) benchmarked their UK COVID-19 Twitter dataset's 3000 observations using Random Forest, Multinomial NB, and SVM. We additionally benchmarked the same portion of the dataset using the existing tree-based gradient boosting models (LGBM, CatBoost, XGboost, GBM), RandomForest+KNN+MLP stacking, RandomForestBagging, and RandomForest+GBM voting. The evaluation of these traditional models was performed individually using CountVectorizer, TF-IDF, and word2vec tokenizers as the tokenization methods.

We also showed how classical models and ensemble work on the pretrained transformer-based tokeizers: BERT (classical and ensemble), roBERTA (classical and ensemble), Sentence Transformer (classical and ensemble)

Pretrained Models

[1] twitter-roberta-base-sentiment-latest

[2] distilbert-base-uncased

[3] all-MiniLM-L6-v2

Deep-Learning Models

All the implemented DL model architectures with their associated codes and outputs can be found in Twitter-RoBERTa+LSTM. Our proposed model Attention-based biLSTM was trained on Twitter-RoBERTa tokenized inputs.

XAI

You can find the relevant files in XAI

LIME

LIME visualization

SHAP

SHAP visualization

Requirements

The installation requirements for the Python packages are already included within the Notebooks, which are not discussed here.

CPU Environment

The Jupyter Notebook can be executed on CPU using Google Colab or Kaggle, but it may take a significant amount of time to obtain the desired outputs.

GPU Environment

Some of the notebooks were executed using Kaggle's GPU T4x2 and GPU P100. Kaggle provides a GPU quota of 30 hours per week, while Colab has a restricted usage limit.

TPU Environment

Some of the notebooks were executed on Kaggle's TPU VM v3-8, which proved to be much faster than GPU. Kaggle provides a quota of 20 hours per week for TPU usage. However, the following additional code needs to be added before constructing the neural network model:

# detect and init the TPU
tpu = tf.distribute.cluster_resolver.TPUClusterResolver.connect()

# instantiate a distribution strategy
tpu_strategy = tf.distribute.experimental.TPUStrategy(tpu)

# instantiating the model in the strategy scope creates the model on the TPU
with tpu_strategy.scope():
    model = tf.keras.Sequential( … ) # define your model normally
    model.compile( … )

# train model normally
model.fit(training_dataset, epochs=EPOCHS, steps_per_epoch=…)

Directory Tour

Below is an illustration of the directory structure of nlp-roBERTa-biLSTM-attention.


πŸ“ nlp-roBERTa-biLSTM-attention
└── πŸ“ BERT
    πŸ“ nlp-roBERTa-biLSTM-attention\BERT
    β”œβ”€β”€ πŸ“„ all_models1.png
    β”œβ”€β”€ πŸ“„ all_models2.png
    β”œβ”€β”€ πŸ“„ all_models3.png
    β”œβ”€β”€ πŸ“„ all_models4.png
    β”œβ”€β”€ πŸ“„ lgb_knn_mlp.png
    β”œβ”€β”€ πŸ“ model1_keras_1_dense_layers
    β”‚   πŸ“ nlp-roBERTa-biLSTM-attention\BERT\model1_keras_1_dense_layers
    β”‚   β”œβ”€β”€ πŸ“„ classification_report.png
    β”‚   β”œβ”€β”€ πŸ“„ confusion_matrix.png
    β”œβ”€β”€ πŸ“ model2_keras_3_dense_layers
    β”‚   πŸ“ nlp-roBERTa-biLSTM-attention\BERT\model2_keras_3_dense_layers
    β”‚   β”œβ”€β”€ πŸ“„ classification_report.png
    β”‚   β”œβ”€β”€ πŸ“„ confusion_matrix.png
    β”œβ”€β”€ πŸ“„ rf_knn_mlp.png
    β”œβ”€β”€ πŸ“„ rf_stacking_voting.png
└── πŸ“ BoW
    πŸ“ nlp-roBERTa-biLSTM-attention\BoW
    β”œβ”€β”€ πŸ“„ all_models_1.png
    β”œβ”€β”€ πŸ“„ all_models_2.png
    β”œβ”€β”€ πŸ“„ all_models_3.png
    β”œβ”€β”€ πŸ“„ all_models_4.png
    β”œβ”€β”€ πŸ“„ lgb_knn_mlp.png
    β”œβ”€β”€ πŸ“ model1_keras_1_dense_layers
    β”‚   πŸ“ nlp-roBERTa-biLSTM-attention\BoW\model1_keras_1_dense_layers
    β”‚   β”œβ”€β”€ πŸ“„ classification_report.png
    β”‚   β”œβ”€β”€ πŸ“„ confusion_matrix.png
    β”œβ”€β”€ πŸ“ model2_keras_3_dense_layers
    β”‚   πŸ“ nlp-roBERTa-biLSTM-attention\BoW\model2_keras_3_dense_layers
    β”‚   β”œβ”€β”€ πŸ“„ classification_report.png
    β”‚   β”œβ”€β”€ πŸ“„ confusion_matrix.png
    β”œβ”€β”€ πŸ“„ rf_knn_mlp.png
    β”œβ”€β”€ πŸ“„ rf_stacking_voting.png
└── πŸ“ Data_scraping
    πŸ“ nlp-roBERTa-biLSTM-attention\Data_scraping
    β”œβ”€β”€ πŸ“„ Twint-data collection.ipynb
    β”œβ”€β”€ πŸ“„ Twitter academic api.ipynb
└── πŸ“ Extended_datasets
    πŸ“ nlp-roBERTa-biLSTM-attention\Extended_datasets
    β”œβ”€β”€ πŸ“ Global_covid_twitter_data
    β”‚   πŸ“ nlp-roBERTa-biLSTM-attention\Extended_datasets\Global_covid_twitter_data
    β”‚   β”œβ”€β”€ πŸ“ BiLSTM+CNN
    β”‚   β”‚   πŸ“ nlp-roBERTa-biLSTM-attention\Extended_datasets\Global_covid_twitter_data\BiLSTM+CNN
    β”‚   β”‚   β”œβ”€β”€ πŸ“„ classification_report1.png
    β”‚   β”‚   β”œβ”€β”€ πŸ“„ classification_report2.png
    β”‚   β”‚   β”œβ”€β”€ πŸ“„ confusion_matrix.png
    β”‚   β”œβ”€β”€ πŸ“„ Global.csv
    β”‚   β”œβ”€β”€ πŸ“„ Global_twitter_data_preprocessing.ipynb
    β”‚   β”œβ”€β”€ πŸ“„ global-tweets_4_baseline_models.ipynb
    β”‚   β”œβ”€β”€ πŸ“ model1_keras_1_dense_layers
    β”‚   β”‚   πŸ“ nlp-roBERTa-biLSTM-attention\Extended_datasets\Global_covid_twitter_data\model1_keras_1_dense_layers
    β”‚   β”‚   β”œβ”€β”€ πŸ“„ classification_report1.png
    β”‚   β”‚   β”œβ”€β”€ πŸ“„ classification_report2.png
    β”‚   β”‚   β”œβ”€β”€ πŸ“„ confusion_matrix.png
    β”‚   β”œβ”€β”€ πŸ“ model2_keras_3_dense_layers
    β”‚   β”‚   πŸ“ nlp-roBERTa-biLSTM-attention\Extended_datasets\Global_covid_twitter_data\model2_keras_3_dense_layers
    β”‚   β”‚   β”œβ”€β”€ πŸ“„ classification_report1.png
    β”‚   β”‚   β”œβ”€β”€ πŸ“„ classification_report2.png
    β”‚   β”‚   β”œβ”€β”€ πŸ“„ confusion_matrix.png
    β”‚   β”œβ”€β”€ πŸ“ model3_BiLSTM
    β”‚   β”‚   πŸ“ nlp-roBERTa-biLSTM-attention\Extended_datasets\Global_covid_twitter_data\model3_BiLSTM
    β”‚   β”‚   β”œβ”€β”€ πŸ“„ classification_report1.png
    β”‚   β”‚   β”œβ”€β”€ πŸ“„ classification_report2.png
    β”‚   β”‚   β”œβ”€β”€ πŸ“„ confusion_matrix.png
    β”‚   β”œβ”€β”€ πŸ“ model4_BiLSTM+attention
    β”‚   β”‚   πŸ“ nlp-roBERTa-biLSTM-attention\Extended_datasets\Global_covid_twitter_data\model4_BiLSTM+attention
    β”‚   β”‚   β”œβ”€β”€ πŸ“„ best-model-global.ipynb
    β”‚   β”‚   β”œβ”€β”€ πŸ“„ classification_report1.png
    β”‚   β”‚   β”œβ”€β”€ πŸ“„ classification_report2.png
    β”‚   β”‚   β”œβ”€β”€ πŸ“„ confusion_matrix.png
    β”‚   β”œβ”€β”€ πŸ“ preprocessed_dataset
    β”‚   β”‚   πŸ“ nlp-roBERTa-biLSTM-attention\Extended_datasets\Global_covid_twitter_data\preprocessed_dataset
    β”‚   β”‚   β”œβ”€β”€ πŸ“„ sample_data_global_0.csv
    β”‚   β”‚   β”œβ”€β”€ πŸ“„ sample_data_global_1.csv
    β”‚   β”‚   β”œβ”€β”€ πŸ“„ sample_data_global_10.csv
    β”‚   β”‚   β”œβ”€β”€ πŸ“„ sample_data_global_11.csv
    β”‚   β”‚   β”œβ”€β”€ πŸ“„ sample_data_global_12.csv
    β”‚   β”‚   β”œβ”€β”€ πŸ“„ sample_data_global_13.csv
    β”‚   β”‚   β”œβ”€β”€ πŸ“„ sample_data_global_14.csv
    β”‚   β”‚   β”œβ”€β”€ πŸ“„ sample_data_global_15.csv
    β”‚   β”‚   β”œβ”€β”€ πŸ“„ sample_data_global_16.csv
    β”‚   β”‚   β”œβ”€β”€ πŸ“„ sample_data_global_17.csv
    β”‚   β”‚   β”œβ”€β”€ πŸ“„ sample_data_global_18.csv
    β”‚   β”‚   β”œβ”€β”€ πŸ“„ sample_data_global_19.csv
    β”‚   β”‚   β”œβ”€β”€ πŸ“„ sample_data_global_2.csv
    β”‚   β”‚   β”œβ”€β”€ πŸ“„ sample_data_global_20.csv
    β”‚   β”‚   β”œβ”€β”€ πŸ“„ sample_data_global_21.csv
    β”‚   β”‚   β”œβ”€β”€ πŸ“„ sample_data_global_22.csv
    β”‚   β”‚   β”œβ”€β”€ πŸ“„ sample_data_global_23.csv
    β”‚   β”‚   β”œβ”€β”€ πŸ“„ sample_data_global_24.csv
    β”‚   β”‚   β”œβ”€β”€ πŸ“„ sample_data_global_25.csv
    β”‚   β”‚   β”œβ”€β”€ πŸ“„ sample_data_global_26.csv
    β”‚   β”‚   β”œβ”€β”€ πŸ“„ sample_data_global_27.csv
    β”‚   β”‚   β”œβ”€β”€ πŸ“„ sample_data_global_28.csv
    β”‚   β”‚   β”œβ”€β”€ πŸ“„ sample_data_global_29.csv
    β”‚   β”‚   β”œβ”€β”€ πŸ“„ sample_data_global_3.csv
    β”‚   β”‚   β”œβ”€β”€ πŸ“„ sample_data_global_30.csv
    β”‚   β”‚   β”œβ”€β”€ πŸ“„ sample_data_global_31.csv
    β”‚   β”‚   β”œβ”€β”€ πŸ“„ sample_data_global_32.csv
    β”‚   β”‚   β”œβ”€β”€ πŸ“„ sample_data_global_33.csv
    β”‚   β”‚   β”œβ”€β”€ πŸ“„ sample_data_global_34.csv
    β”‚   β”‚   β”œβ”€β”€ πŸ“„ sample_data_global_35.csv
    β”‚   β”‚   β”œβ”€β”€ πŸ“„ sample_data_global_36.csv
    β”‚   β”‚   β”œβ”€β”€ πŸ“„ sample_data_global_37.csv
    β”‚   β”‚   β”œβ”€β”€ πŸ“„ sample_data_global_38.csv
    β”‚   β”‚   β”œβ”€β”€ πŸ“„ sample_data_global_39.csv
    β”‚   β”‚   β”œβ”€β”€ πŸ“„ sample_data_global_4.csv
    β”‚   β”‚   β”œβ”€β”€ πŸ“„ sample_data_global_40.csv
    β”‚   β”‚   β”œβ”€β”€ πŸ“„ sample_data_global_5.csv
    β”‚   β”‚   β”œβ”€β”€ πŸ“„ sample_data_global_6.csv
    β”‚   β”‚   β”œβ”€β”€ πŸ“„ sample_data_global_7.csv
    β”‚   β”‚   β”œβ”€β”€ πŸ“„ sample_data_global_8.csv
    β”‚   β”‚   β”œβ”€β”€ πŸ“„ sample_data_global_9.csv
    β”‚   β”œβ”€β”€ πŸ“„ tweets_distribution_global.png
    β”‚   β”œβ”€β”€ πŸ“„ word_cloud_global.png
    β”‚   β”œβ”€β”€ πŸ“„ word_freq.png
    β”œβ”€β”€ πŸ“ Only_USA_covid_twitter_data
    β”‚   πŸ“ nlp-roBERTa-biLSTM-attention\Extended_datasets\Only_USA_covid_twitter_data
    β”‚   └── πŸ“ BiLSTM+CNN
    β”‚       πŸ“ nlp-roBERTa-biLSTM-attention\Extended_datasets\Only_USA_covid_twitter_data\BiLSTM+CNN
    β”‚       β”œβ”€β”€ πŸ“„ classification_report1.png
    β”‚       β”œβ”€β”€ πŸ“„ classification_reports.png
    β”‚       β”œβ”€β”€ πŸ“„ confusion_matrix.png
    β”‚   └── πŸ“„ Only_USA.csv
    β”‚   └── πŸ“„ frequency.png
    β”‚   └── πŸ“ model1_keras_1_dense_layers
    β”‚       πŸ“ nlp-roBERTa-biLSTM-attention\Extended_datasets\Only_USA_covid_twitter_data\model1_keras_1_dense_layers
    β”‚       β”œβ”€β”€ πŸ“„ classification_report1.png
    β”‚       β”œβ”€β”€ πŸ“„ classification_reports.png
    β”‚       β”œβ”€β”€ πŸ“„ confusion_matrix.png
    β”‚   └── πŸ“ model2_keras_3_dense_layers
    β”‚       πŸ“ nlp-roBERTa-biLSTM-attention\Extended_datasets\Only_USA_covid_twitter_data\model2_keras_3_dense_layers
    β”‚       β”œβ”€β”€ πŸ“„ classification_report1.png
    β”‚       β”œβ”€β”€ πŸ“„ classification_reports.png
    β”‚       β”œβ”€β”€ πŸ“„ confusion_matrix.png
    β”‚   └── πŸ“ model3_BiLSTM
    β”‚       πŸ“ nlp-roBERTa-biLSTM-attention\Extended_datasets\Only_USA_covid_twitter_data\model3_BiLSTM
    β”‚       β”œβ”€β”€ πŸ“„ classification_report1.png
    β”‚       β”œβ”€β”€ πŸ“„ classification_reports.png
    β”‚       β”œβ”€β”€ πŸ“„ confusion_matrix.png
    β”‚   └── πŸ“ model4_BiLSTM+attention
    β”‚       πŸ“ nlp-roBERTa-biLSTM-attention\Extended_datasets\Only_USA_covid_twitter_data\model4_BiLSTM+attention
    β”‚       β”œβ”€β”€ πŸ“„ accuracy.png
    β”‚       β”œβ”€β”€ πŸ“„ best-model-only-usa.ipynb
    β”‚       β”œβ”€β”€ πŸ“„ classification_report1.png
    β”‚       β”œβ”€β”€ πŸ“„ classification_reports.png
    β”‚       β”œβ”€β”€ πŸ“„ confusion_matrix.png
    β”‚       β”œβ”€β”€ πŸ“„ loss.png
    β”‚       β”œβ”€β”€ πŸ“„ model_architecture.png
    β”‚   └── πŸ“„ only_USA-tweets-4_baseline_models.ipynb
    β”‚   └── πŸ“„ only_USA_twitter_data_preprocessing.ipynb
    β”‚   └── πŸ“„ sample_data_only_USA.csv
    β”‚   └── πŸ“„ uk_covid_twitter_sentiment.ipynb
    β”‚   └── πŸ“„ word_cloud.png
└── πŸ“ External_datasets
    πŸ“ nlp-roBERTa-biLSTM-attention\External_datasets
    β”œβ”€β”€ πŸ“ Apple_twitter_sentiments
    β”‚   πŸ“ nlp-roBERTa-biLSTM-attention\External_datasets\Apple_twitter_sentiments
    β”‚   β”œβ”€β”€ πŸ“ BiLSTM+CNN
    β”‚   β”‚   πŸ“ nlp-roBERTa-biLSTM-attention\External_datasets\Apple_twitter_sentiments\BiLSTM+CNN
    β”‚   β”‚   β”œβ”€β”€ πŸ“„ classification_report1.png
    β”‚   β”‚   β”œβ”€β”€ πŸ“„ classification_report2.png
    β”‚   β”‚   β”œβ”€β”€ πŸ“„ confusion_matrix.png
    β”‚   β”œβ”€β”€ πŸ“„ apple-tweets.ipynb
    β”‚   β”œβ”€β”€ πŸ“ model1_keras_1_dense_layers
    β”‚   β”‚   πŸ“ nlp-roBERTa-biLSTM-attention\External_datasets\Apple_twitter_sentiments\model1_keras_1_dense_layers
    β”‚   β”‚   β”œβ”€β”€ πŸ“„ classification_report1.png
    β”‚   β”‚   β”œβ”€β”€ πŸ“„ classification_report2.png
    β”‚   β”‚   β”œβ”€β”€ πŸ“„ confusion_matrix.png
    β”‚   β”œβ”€β”€ πŸ“ model2_keras_3_dense_layers
    β”‚   β”‚   πŸ“ nlp-roBERTa-biLSTM-attention\External_datasets\Apple_twitter_sentiments\model2_keras_3_dense_layers
    β”‚   β”‚   β”œβ”€β”€ πŸ“„ classification_report1.png
    β”‚   β”‚   β”œβ”€β”€ πŸ“„ classification_report2.png
    β”‚   β”‚   β”œβ”€β”€ πŸ“„ confusion_matrix.png
    β”‚   β”œβ”€β”€ πŸ“ model3_BiLSTM
    β”‚   β”‚   πŸ“ nlp-roBERTa-biLSTM-attention\External_datasets\Apple_twitter_sentiments\model3_BiLSTM
    β”‚   β”‚   β”œβ”€β”€ πŸ“„ classification_report1.png
    β”‚   β”‚   β”œβ”€β”€ πŸ“„ classification_report2.png
    β”‚   β”‚   β”œβ”€β”€ πŸ“„ confusion_matrix.png
    β”‚   β”œβ”€β”€ πŸ“ model4_BiLSTM+attention
    β”‚   β”‚   πŸ“ nlp-roBERTa-biLSTM-attention\External_datasets\Apple_twitter_sentiments\model4_BiLSTM+attention
    β”‚   β”‚   └── πŸ“„ accuracy.png
    β”‚   β”‚   └── πŸ“„ best-model-apple-twitter.ipynb
    β”‚   β”‚   └── πŸ“„ classification_reports1.png
    β”‚   β”‚   └── πŸ“„ classification_reports2.png
    β”‚   β”‚   └── πŸ“„ confusion_matrix.png
    β”‚   β”‚   └── πŸ“„ loss.png
    β”œβ”€β”€ πŸ“ Reddit
    β”‚   πŸ“ nlp-roBERTa-biLSTM-attention\External_datasets\Reddit
    β”‚   β”œβ”€β”€ πŸ“ BiLSTM+CNN
    β”‚   β”‚   πŸ“ nlp-roBERTa-biLSTM-attention\External_datasets\Reddit\BiLSTM+CNN
    β”‚   β”‚   β”œβ”€β”€ πŸ“„ classification_report1.png
    β”‚   β”‚   β”œβ”€β”€ πŸ“„ classification_report2.png
    β”‚   β”‚   β”œβ”€β”€ πŸ“„ confusion_matrix.png
    β”‚   β”œβ”€β”€ πŸ“„ Reddit_Data.csv
    β”‚   β”œβ”€β”€ πŸ“ model1_keras_1_dense_layers
    β”‚   β”‚   πŸ“ nlp-roBERTa-biLSTM-attention\External_datasets\Reddit\model1_keras_1_dense_layers
    β”‚   β”‚   β”œβ”€β”€ πŸ“„ classification_report1.png
    β”‚   β”‚   β”œβ”€β”€ πŸ“„ classification_report2.png
    β”‚   β”‚   β”œβ”€β”€ πŸ“„ confusion_matrix.png
    β”‚   β”œβ”€β”€ πŸ“ model2_keras_3_dense_layers
    β”‚   β”‚   πŸ“ nlp-roBERTa-biLSTM-attention\External_datasets\Reddit\model2_keras_3_dense_layers
    β”‚   β”‚   β”œβ”€β”€ πŸ“„ classification_report1.png
    β”‚   β”‚   β”œβ”€β”€ πŸ“„ classification_report2.png
    β”‚   β”‚   β”œβ”€β”€ πŸ“„ confusion_matrix.png
    β”‚   β”œβ”€β”€ πŸ“ model3_BiLSTM
    β”‚   β”‚   πŸ“ nlp-roBERTa-biLSTM-attention\External_datasets\Reddit\model3_BiLSTM
    β”‚   β”‚   β”œβ”€β”€ πŸ“„ classification_report1.png
    β”‚   β”‚   β”œβ”€β”€ πŸ“„ classification_report2.png
    β”‚   β”‚   β”œβ”€β”€ πŸ“„ confusion_matrix.png
    β”‚   β”œβ”€β”€ πŸ“ model4_BiLSTM+attention
    β”‚   β”‚   πŸ“ nlp-roBERTa-biLSTM-attention\External_datasets\Reddit\model4_BiLSTM+attention
    β”‚   β”‚   β”œβ”€β”€ πŸ“„ LIME.png
    β”‚   β”‚   β”œβ”€β”€ πŸ“„ SHAP_bar.png
    β”‚   β”‚   β”œβ”€β”€ πŸ“„ SHAP_bar_ascending.png
    β”‚   β”‚   β”œβ”€β”€ πŸ“„ SHAP_bar_descending.png
    β”‚   β”‚   β”œβ”€β”€ πŸ“„ SHAP_explain.png
    β”‚   β”‚   β”œβ”€β”€ πŸ“„ best-model-reddit.ipynb
    β”‚   β”‚   β”œβ”€β”€ πŸ“„ classification_report1.png
    β”‚   β”‚   β”œβ”€β”€ πŸ“„ classification_report2.png
    β”‚   β”‚   β”œβ”€β”€ πŸ“„ confusion_matrix.png
    β”‚   β”‚   β”œβ”€β”€ πŸ“„ target_predictions.png
    β”‚   β”œβ”€β”€ πŸ“„ reddit-tweets-1.ipynb
    β”‚   β”œβ”€β”€ πŸ“„ reddit-tweets-2.ipynb
    β”œβ”€β”€ πŸ“ Twitter
    β”‚   πŸ“ nlp-roBERTa-biLSTM-attention\External_datasets\Twitter
    β”‚   β”œβ”€β”€ πŸ“ BiLSTM+CNN
    β”‚   β”‚   πŸ“ nlp-roBERTa-biLSTM-attention\External_datasets\Twitter\BiLSTM+CNN
    β”‚   β”‚   β”œβ”€β”€ πŸ“„ classification_report1.png
    β”‚   β”‚   β”œβ”€β”€ πŸ“„ classification_report2.png
    β”‚   β”‚   β”œβ”€β”€ πŸ“„ confusion_matrix.png
    β”‚   β”œβ”€β”€ πŸ“„ Twitter_Data.csv
    β”‚   β”œβ”€β”€ πŸ“ model1_keras_1_dense_layers
    β”‚   β”‚   πŸ“ nlp-roBERTa-biLSTM-attention\External_datasets\Twitter\model1_keras_1_dense_layers
    β”‚   β”‚   β”œβ”€β”€ πŸ“„ classification_report1.png
    β”‚   β”‚   β”œβ”€β”€ πŸ“„ classification_report2.png
    β”‚   β”‚   β”œβ”€β”€ πŸ“„ confusion_matrix.png
    β”‚   β”œβ”€β”€ πŸ“ model2_keras_3_dense_layers
    β”‚   β”‚   πŸ“ nlp-roBERTa-biLSTM-attention\External_datasets\Twitter\model2_keras_3_dense_layers
    β”‚   β”‚   β”œβ”€β”€ πŸ“„ classification_report1.png
    β”‚   β”‚   β”œβ”€β”€ πŸ“„ classification_report2.png
    β”‚   β”‚   β”œβ”€β”€ πŸ“„ confusion_matrix.png
    β”‚   β”œβ”€β”€ πŸ“ model3_BiLSTM
    β”‚   β”‚   πŸ“ nlp-roBERTa-biLSTM-attention\External_datasets\Twitter\model3_BiLSTM
    β”‚   β”‚   β”œβ”€β”€ πŸ“„ classification_report1.png
    β”‚   β”‚   β”œβ”€β”€ πŸ“„ classification_report2.png
    β”‚   β”‚   β”œβ”€β”€ πŸ“„ confusion_matrix.png
    β”‚   β”œβ”€β”€ πŸ“ model4_BiLSTM+attention
    β”‚   β”‚   πŸ“ nlp-roBERTa-biLSTM-attention\External_datasets\Twitter\model4_BiLSTM+attention
    β”‚   β”‚   β”œβ”€β”€ πŸ“„ best-model-twitter-external.ipynb
    β”‚   β”‚   β”œβ”€β”€ πŸ“„ classification_report1.png
    β”‚   β”‚   β”œβ”€β”€ πŸ“„ classification_report2.png
    β”‚   β”‚   β”œβ”€β”€ πŸ“„ confusion_matrix.png
    β”‚   β”œβ”€β”€ πŸ“„ twitter-tweets-2.ipynb
    β”‚   β”œβ”€β”€ πŸ“„ twitter_tweets_1.ipynb
    β”œβ”€β”€ πŸ“ US_airlines_twitter_sentiments
    β”‚   πŸ“ nlp-roBERTa-biLSTM-attention\External_datasets\US_airlines_twitter_sentiments
    β”‚   β”œβ”€β”€ πŸ“ BiLSTM+CNN
    β”‚   β”‚   πŸ“ nlp-roBERTa-biLSTM-attention\External_datasets\US_airlines_twitter_sentiments\BiLSTM+CNN
    β”‚   β”‚   β”œβ”€β”€ πŸ“„ classification_report1.png
    β”‚   β”‚   β”œβ”€β”€ πŸ“„ classification_report2.png
    β”‚   β”‚   β”œβ”€β”€ πŸ“„ confusion_matrix.png
    β”‚   β”œβ”€β”€ πŸ“ model1_keras_1_dense_layers
    β”‚   β”‚   πŸ“ nlp-roBERTa-biLSTM-attention\External_datasets\US_airlines_twitter_sentiments\model1_keras_1_dense_layers
    β”‚   β”‚   β”œβ”€β”€ πŸ“„ classification_report1.png
    β”‚   β”‚   β”œβ”€β”€ πŸ“„ classification_report2.png
    β”‚   β”‚   β”œβ”€β”€ πŸ“„ confusion_matrix.png
    β”‚   β”œβ”€β”€ πŸ“ model2_keras_3_dense_layers
    β”‚   β”‚   πŸ“ nlp-roBERTa-biLSTM-attention\External_datasets\US_airlines_twitter_sentiments\model2_keras_3_dense_layers
    β”‚   β”‚   β”œβ”€β”€ πŸ“„ classification_report1.png
    β”‚   β”‚   β”œβ”€β”€ πŸ“„ classification_report2.png
    β”‚   β”‚   β”œβ”€β”€ πŸ“„ confusion_matrix.png
    β”‚   β”œβ”€β”€ πŸ“ model3_BiLSTM
    β”‚   β”‚   πŸ“ nlp-roBERTa-biLSTM-attention\External_datasets\US_airlines_twitter_sentiments\model3_BiLSTM
    β”‚   β”‚   β”œβ”€β”€ πŸ“„ classification_report1.png
    β”‚   β”‚   β”œβ”€β”€ πŸ“„ classification_report2.png
    β”‚   β”‚   β”œβ”€β”€ πŸ“„ confusion_matrix.png
    β”‚   β”œβ”€β”€ πŸ“ model4_BiLSTM+attention
    β”‚   β”‚   πŸ“ nlp-roBERTa-biLSTM-attention\External_datasets\US_airlines_twitter_sentiments\model4_BiLSTM+attention
    β”‚   β”‚   └── πŸ“„ best-model-us-airlines.ipynb
    β”‚   β”‚   └── πŸ“„ classification_report1.png
    β”‚   β”‚   └── πŸ“„ classification_report2.png
    β”‚   β”‚   └── πŸ“„ confusion_matrix.png
    β”œβ”€β”€ πŸ“„ token.txt
└── πŸ“„ LICENSE
└── πŸ“ Previous_research
    πŸ“ nlp-roBERTa-biLSTM-attention\Previous_research
    β”œβ”€β”€ πŸ“„ 1.png
    β”œβ”€β”€ πŸ“„ 2.png
    β”œβ”€β”€ πŸ“„ Vaibhav 2022.pdf
    β”œβ”€β”€ πŸ“„ Yuxing 2023.pdf
└── πŸ“„ README.md
└── πŸ“ RoBERTa
    πŸ“ nlp-roBERTa-biLSTM-attention\RoBERTa
    β”œβ”€β”€ πŸ“„ cardiff_all_models_1.png
    β”œβ”€β”€ πŸ“„ cardiff_all_models_2.png
    β”œβ”€β”€ πŸ“„ cardiff_all_models_3.png
    β”œβ”€β”€ πŸ“„ cardiff_all_models_4.png
    β”œβ”€β”€ πŸ“„ lgb+knn+mlp.png
    β”œβ”€β”€ πŸ“„ rf_stacking_voting.png
    β”œβ”€β”€ πŸ“„ roberta_base_rf+knn+mlp.png
└── πŸ“ SBERT
    πŸ“ nlp-roBERTa-biLSTM-attention\SBERT
    β”œβ”€β”€ πŸ“„ all_models_1.png
    β”œβ”€β”€ πŸ“„ all_models_2.png
    β”œβ”€β”€ πŸ“„ all_models_3.png
    β”œβ”€β”€ πŸ“„ all_models_4.png
    β”œβ”€β”€ πŸ“„ all_models_5.png
    β”œβ”€β”€ πŸ“„ lgb_knn_mlp.png
    β”œβ”€β”€ πŸ“ model1_keras_1_dense_layers
    β”‚   πŸ“ nlp-roBERTa-biLSTM-attention\SBERT\model1_keras_1_dense_layers
    β”‚   β”œβ”€β”€ πŸ“„ classification_report.png
    β”‚   β”œβ”€β”€ πŸ“„ confusion_matrix.png
    β”œβ”€β”€ πŸ“ model2_keras_3_dense_layers
    β”‚   πŸ“ nlp-roBERTa-biLSTM-attention\SBERT\model2_keras_3_dense_layers
    β”‚   β”œβ”€β”€ πŸ“„ classification_report.png
    β”‚   β”œβ”€β”€ πŸ“„ confusion_matrix.png
    β”œβ”€β”€ πŸ“„ rf_knn_mlp.png
    β”œβ”€β”€ πŸ“„ rf_stacking_voting.png
└── πŸ“ TF-IDF
    πŸ“ nlp-roBERTa-biLSTM-attention\TF-IDF
    β”œβ”€β”€ πŸ“„ all_models_1.png
    β”œβ”€β”€ πŸ“„ all_models_2.png
    β”œβ”€β”€ πŸ“„ all_models_3.png
    β”œβ”€β”€ πŸ“„ all_models_4.png
    β”œβ”€β”€ πŸ“„ lgbm_knn_mlp.png
    β”œβ”€β”€ πŸ“ model1_keras_1_dense_layers
    β”‚   πŸ“ nlp-roBERTa-biLSTM-attention\TF-IDF\model1_keras_1_dense_layers
    β”‚   β”œβ”€β”€ πŸ“„ classification_report.png
    β”‚   β”œβ”€β”€ πŸ“„ confusion_matrix.png
    β”œβ”€β”€ πŸ“ model2_keras_3_dense_layers
    β”‚   πŸ“ nlp-roBERTa-biLSTM-attention\TF-IDF\model2_keras_3_dense_layers
    β”‚   β”œβ”€β”€ πŸ“„ classification_report.png
    β”‚   β”œβ”€β”€ πŸ“„ confusion_matrix.png
    β”œβ”€β”€ πŸ“„ rf_bagging.png
    β”œβ”€β”€ πŸ“„ rf_knn_mlp.png
    β”œβ”€β”€ πŸ“„ rf_stacking_voting.png
└── πŸ“ Target_lexicon_selection
    πŸ“ nlp-roBERTa-biLSTM-attention\Target_lexicon_selection
    β”œβ”€β”€ πŸ“„ target_lexicon_selection.ipynb
    β”œβ”€β”€ πŸ“„ textblob1.png
    β”œβ”€β”€ πŸ“„ textblob2.png
    β”œβ”€β”€ πŸ“„ textblob3.png
    β”œβ”€β”€ πŸ“„ textblob4.png
    β”œβ”€β”€ πŸ“„ vader1.png
    β”œβ”€β”€ πŸ“„ vader2.png
    β”œβ”€β”€ πŸ“„ vader3.png
    β”œβ”€β”€ πŸ“„ vader4.png
    β”œβ”€β”€ πŸ“„ wordnet1.png
    β”œβ”€β”€ πŸ“„ wordnet2.png
    β”œβ”€β”€ πŸ“„ wordnet3.png
    β”œβ”€β”€ πŸ“„ wordnet4.png
└── πŸ“ Twitter-RoBERTa+LSTM
    πŸ“ nlp-roBERTa-biLSTM-attention\Twitter-RoBERTa+LSTM
    β”œβ”€β”€ πŸ“ BiLSTM+CNN
    β”‚   πŸ“ nlp-roBERTa-biLSTM-attention\Twitter-RoBERTa+LSTM\BiLSTM+CNN
    β”‚   β”œβ”€β”€ πŸ“„ accuracy.png
    β”‚   β”œβ”€β”€ πŸ“„ biLSTM+CNN.ipynb
    β”‚   β”œβ”€β”€ πŸ“„ classification_report.png
    β”‚   β”œβ”€β”€ πŸ“„ confusion_matrix.png
    β”‚   β”œβ”€β”€ πŸ“„ loss.png
    β”‚   β”œβ”€β”€ πŸ“„ model_architecture.png
    β”œβ”€β”€ πŸ“ model1_keras_1_dense_layers
    β”‚   πŸ“ nlp-roBERTa-biLSTM-attention\Twitter-RoBERTa+LSTM\model1_keras_1_dense_layers
    β”‚   β”œβ”€β”€ πŸ“„ accuracy.png
    β”‚   β”œβ”€β”€ πŸ“„ classification_report1.png
    β”‚   β”œβ”€β”€ πŸ“„ classification_report2.png
    β”‚   β”œβ”€β”€ πŸ“„ confusion_matrix.png
    β”‚   β”œβ”€β”€ πŸ“„ loss.png
    β”‚   β”œβ”€β”€ πŸ“„ model1.ipynb
    β”‚   β”œβ”€β”€ πŸ“„ model_architecture.png
    β”‚   β”œβ”€β”€ πŸ“„ summary.png
    β”œβ”€β”€ πŸ“ model2_keras_3_dense_layers
    β”‚   πŸ“ nlp-roBERTa-biLSTM-attention\Twitter-RoBERTa+LSTM\model2_keras_3_dense_layers
    β”‚   β”œβ”€β”€ πŸ“„ accuracy.png
    β”‚   β”œβ”€β”€ πŸ“„ classification_report.png
    β”‚   β”œβ”€β”€ πŸ“„ classification_report1.png
    β”‚   β”œβ”€β”€ πŸ“„ confusion_matrix.png
    β”‚   β”œβ”€β”€ πŸ“„ loss.png
    β”‚   β”œβ”€β”€ πŸ“„ model2.ipynb
    β”‚   β”œβ”€β”€ πŸ“„ model_architecture.png
    β”‚   β”œβ”€β”€ πŸ“„ model_summary.png
    β”œβ”€β”€ πŸ“ model3_BiLSTM
    β”‚   πŸ“ nlp-roBERTa-biLSTM-attention\Twitter-RoBERTa+LSTM\model3_BiLSTM
    β”‚   β”œβ”€β”€ πŸ“„ accuracy.png
    β”‚   β”œβ”€β”€ πŸ“„ classification_report1.png
    β”‚   β”œβ”€β”€ πŸ“„ classification_report2.png
    β”‚   β”œβ”€β”€ πŸ“„ confusion_matrix.png
    β”‚   β”œβ”€β”€ πŸ“„ loss.png
    β”‚   β”œβ”€β”€ πŸ“„ model_architecture.png
    β”‚   β”œβ”€β”€ πŸ“„ summary.png
    β”‚   β”œβ”€β”€ πŸ“„ target_val_counts.png
    β”‚   β”œβ”€β”€ πŸ“„ train_val.png
    β”œβ”€β”€ πŸ“ model4_BiLSTM+attention
    β”‚   πŸ“ nlp-roBERTa-biLSTM-attention\Twitter-RoBERTa+LSTM\model4_BiLSTM+attention
    β”‚   └── πŸ“ XAI
    β”‚       πŸ“ nlp-roBERTa-biLSTM-attention\Twitter-RoBERTa+LSTM\model4_BiLSTM+attention\XAI
    β”‚       β”œβ”€β”€ πŸ“ Lime
    β”‚       β”‚   πŸ“ nlp-roBERTa-biLSTM-attention\Twitter-RoBERTa+LSTM\model4_BiLSTM+attention\XAI\Lime
    β”‚       β”‚   β”œβ”€β”€ πŸ“„ lime1.png
    β”‚       β”‚   β”œβ”€β”€ πŸ“„ lime2.png
    β”‚       β”‚   β”œβ”€β”€ πŸ“„ lime3.png
    β”‚       β”‚   β”œβ”€β”€ πŸ“„ lime4.png
    β”‚       β”‚   β”œβ”€β”€ πŸ“„ lime5.png
    β”‚       β”‚   β”œβ”€β”€ πŸ“„ lime6.png
    β”‚       β”‚   β”œβ”€β”€ πŸ“„ lime7.png
    β”‚       β”œβ”€β”€ πŸ“ SHAP
    β”‚       β”‚   πŸ“ nlp-roBERTa-biLSTM-attention\Twitter-RoBERTa+LSTM\model4_BiLSTM+attention\XAI\SHAP
    β”‚       β”‚   └── πŸ“„ shap_neg1.png
    β”‚       β”‚   └── πŸ“„ shap_neg2.png
    β”‚       β”‚   └── πŸ“„ shap_neg_bar_ascending.png
    β”‚       β”‚   └── πŸ“„ shap_neg_bar_descending.png
    β”‚       β”‚   └── πŸ“„ shap_neu1.png
    β”‚       β”‚   └── πŸ“„ shap_neu2.png
    β”‚       β”‚   └── πŸ“„ shap_neu_bar.png
    β”‚       β”‚   └── πŸ“„ shap_neu_bar_ascending.png
    β”‚       β”‚   └── πŸ“„ shap_neu_bar_descending.png
    β”‚       β”‚   └── πŸ“„ shap_pos1.png
    β”‚       β”‚   └── πŸ“„ shap_pos2.png
    β”‚       β”‚   └── πŸ“„ shap_pos_bar_ascending.png
    β”‚       β”‚   └── πŸ“„ shap_pos_bar_descending.png
    β”‚   └── πŸ“„ learning_rates.png
    β”‚   └── πŸ“„ model_architecture.png
    β”‚   └── πŸ“„ summary.png
    β”‚   └── πŸ“ uk_twitter_data_3k
    β”‚       πŸ“ nlp-roBERTa-biLSTM-attention\Twitter-RoBERTa+LSTM\model4_BiLSTM+attention\uk_twitter_data_3k
    β”‚       β”œβ”€β”€ πŸ“„ accuracy.png
    β”‚       β”œβ”€β”€ πŸ“„ best-model_uk-tweet_3k.ipynb
    β”‚       β”œβ”€β”€ πŸ“„ classification_report.png
    β”‚       β”œβ”€β”€ πŸ“„ classification_report2.png
    β”‚       β”œβ”€β”€ πŸ“„ confusion_matrix.png
    β”‚       β”œβ”€β”€ πŸ“„ loss.png
    β”‚       β”œβ”€β”€ πŸ“„ train_val_loss.png
    β”‚   └── πŸ“ uk_twitter_data_all
    β”‚       πŸ“ nlp-roBERTa-biLSTM-attention\Twitter-RoBERTa+LSTM\model4_BiLSTM+attention\uk_twitter_data_all
    β”‚       └── πŸ“„ accuracy.png
    β”‚       └── πŸ“„ best-model-uk-twitter-all.ipynb
    β”‚       └── πŸ“„ classification_report1.png
    β”‚       └── πŸ“„ classification_report2.png
    β”‚       └── πŸ“„ collage.png
    β”‚       └── πŸ“„ confusion_matrix.png
    β”‚       └── πŸ“„ loss.png
└── πŸ“ UK_covid_twitter_data
    πŸ“ nlp-roBERTa-biLSTM-attention\UK_covid_twitter_data
    β”œβ”€β”€ πŸ“„ all_cities.csv
    β”œβ”€β”€ πŸ“„ sample_data_3000.csv
    β”œβ”€β”€ πŸ“„ sample_data_all.csv
    β”œβ”€β”€ πŸ“„ stacked bar graph.png
    β”œβ”€β”€ πŸ“„ tweets distribution.png
    β”œβ”€β”€ πŸ“„ uk_twitter_data_preprocessing.ipynb
└── πŸ“„ list.md
└── πŸ“„ sentiment_distribution_barchart.png
└── πŸ“„ sentiment_distribution_pie_chart.png
└── πŸ“„ uk-twitter-3k-classical-modelling.ipynb
└── πŸ“ word2vec
    πŸ“ nlp-roBERTa-biLSTM-attention\word2vec
    └── πŸ“„ all_models_1.png
    └── πŸ“„ all_models_2.png
    └── πŸ“„ all_models_3.png
    └── πŸ“„ all_models_4.png
    └── πŸ“„ lgb_knn_mlp.png
    └── πŸ“ model1_keras_1_dense_layers
        πŸ“ nlp-roBERTa-biLSTM-attention\word2vec\model1_keras_1_dense_layers
        β”œβ”€β”€ πŸ“„ classification_report.png
        β”œβ”€β”€ πŸ“„ confusion_matrix.png
    └── πŸ“ model2_keras_3_dense_layers
        πŸ“ nlp-roBERTa-biLSTM-attention\word2vec\model2_keras_3_dense_layers
        β”œβ”€β”€ πŸ“„ classification_report.png
        β”œβ”€β”€ πŸ“„ confusion_matrix.png
    └── πŸ“„ rf_knn_mlp.png
    └── πŸ“„ rf_stacking_voting.png

​


Citations for datasets

Kaggle

 @misc{md abrar jahin_2023,
	title={Extended Covid Twitter Datasets},
	url={https://www.kaggle.com/ds/3205649},
	DOI={10.34740/KAGGLE/DS/3205649},
	publisher={Kaggle},
	author={Md Abrar Jahin},
	year={2023}
}

Mendeley

Jahin, Md Abrar (2023), β€œExtended Covid Twitter Datasets”, Mendeley Data, V1, doi: 10.17632/2ynwykrfgf.1

Code

@software{md_abrar_jahin_2024_13840678,
  author       = {Md Abrar Jahin},
  title        = {Abrar2652/nlp-roBERTa-biLSTM-attention: v1.0.0},
  month        = sep,
  year         = 2024,
  publisher    = {Zenodo},
  version      = {v1.0.0},
  doi          = {10.5281/zenodo.13840678},
  url          = {https://doi.org/10.5281/zenodo.13840678}
}

License

MIT licensed, except where otherwise stated. See LICENSE.txt file.

About

An NLP research project utilizing the "cardiffnlp/twitter-roberta-base-sentiment-latest" pre-trained transformer for tweet tokenization. The project includes an attention-based biLSTM model that predicts sentiment labels for tweets as negative (-1), neutral (0), or positive (1).

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published