Skip to content

Developing a Sarcasm Detection Solution using Machine Learning and Deep Learning Approaches

License

Notifications You must be signed in to change notification settings

mollha/Sarcasm-Detection

Repository files navigation

Machine Learning and Deep Learning Approaches to Sarcasm Detection

This project addresses the problem of sarcasm detection - often quoted as a subtask of sentiment analysis. There are two main scripts used to begin using this code - train.py requires a fair amount of setup, however console.py can be run very quickly, so long as the correct dependencies are installed (listed below)

  • train.py : trains and evaluates new models on chosen dataset, saving these models to Code/pkg/trained_models/
    • To run train.py, follow the Data configuration and Setup instructions before proceeding
  • console.py : makes predictions using existing trained models, where user input can be provided via a console. A visualisation of attention weights is produced in /colorise.html
    • Our best-performing model, the Bidirectional Long Short-Term Memory model trained on ElMo vectors with Attention is provided to get started
    • It is possible to interact with other models, however they will need to be trained first using train.py

To use this code, clone this repository then navigate to the root directory.

  • Move to Code/, then:
    • On Windows, execute the command "python console.py" or "python train.py"
    • On Linux, execute the command "python3 console.py" or "python3 train.py"

List of dependencies:

Data configuration and Setup

Datasets can be collected from the following sources:

  • Twitter data - Ptáček et al. (2014):

    • Collected from: http://liks.fav.zcu.cz/sarcasm/
    • Uses the EN balanced corpus containing 100,000 tweet IDs that must be scraped from the Twitter API - Twitter scraper can be found in Code/pkg/datasets/ptacek/processing_scrips/TwitterCrawler
    • Once downloaded, move normal.txt and sarcastic.txt (files from download) into Code/pkg/datasets/ptacek/raw_data\
  • News headlines - Misra et al. (2019):

  • Amazon reviews - Filatova et al. (2012):

    • Data is downloaded in .rar format
    • https://github.com/ef2020/SarcasmAmazonReviewsCorpus/
    • Only Ironic.rar and Regular.rar is used in this project
    • Convert Ironic.rar and Regular.rar (files from download) into regular folders, then move them to Code/pkg/datasets/amazon_reviews/raw_data

After downloading the data - proceed to reformat it into a csv and apply our data cleaning processes:

  • Run p1_create_raw_csv.py followed by p2_clean_original_data.py to achieve the correct configuration => NOTE: p1_create_raw_csv.py will take some time on the Twitter dataset, as it is slow to scrape the Tweets given their ids
    e.g. Code/pkg/datasets/news_headlines/
                                                      ├── /processed_data
                                                                          ├── ...
                                                                          ├── /CleanData.csv
                                                                          ├── /OriginalData.csv
                                                      ├── /processing_scripts/...
                                                      ├── /raw_Data/...

Language models can be downloaded from the following sources:

  • ELMO:

  • GloVe

    • Download the GloVe database
    • Select the glove.twitter.27B.50d.txt file and place it in a subdirectory called glove, e.g. Code/pkg/language_models/glove/

Example Visualisation 1 Example Visualisation 2