Skip to content

devil-cyber/Reddit-Flair-Detection

Repository files navigation

Directory Structure

The directory is a Flask web application set-up for hosting on Pivotal servers. The description of files and folders can be found below:

  1. app.py --The file used to start the Flask server.
  2. requirements.txt --Containing all Python dependencies of the project.
  3. Procfile -- Needed to setup Pivotal.
  4. templates --Folder containing HTML/CSS files.
  5. Models --Folder containing the saved model.

Project Execution

  1. Open the Terminal.
  2. Clone the repository by entering https://github.com/devil-cyber/Reddit-Flair-Detection
  3. Ensure that Python3 and pip is installed on the system.
  4. Create a virtualenv by executing the following command: virtualenv -p python3 env.
  5. Activate the env virtual environment by executing the follwing command: source env/bin/activate.
  6. Enter the cloned repository directory and execute pip install -r requirements.txt.
  7. Enter python shell and import nltk. Execute nltk.download('stopwords') and exit the shell.
  8. Now, execute the following command: python manage.py runserver and it will point to the localhost with the port.
  9. Hit the IP Address on a web browser and use the application.

Dependencies

The following dependencies can be found in requirements.txt:

  1. praw
  2. scikit-learn
  3. nltk
  4. Flask
  5. bs4
  6. pandas
  7. numpy

Approach

Going through various literatures available for text processing and suitable machine learning algorithms for text classification, I based my approach using [2] which described various machine learning models like Naive-Bayes, Linear SVM and Logistic Regression for text classification with code snippets. Along with this, I tried other models like Random Forest Algorithm. I have obtained test accuracies on various scenarios which can be found in the next section.

The approach taken for the task is as follows:

  1. Collect 1800 India subreddit data for each of the 15 flairs using praw module [1].
  2. The data includes title, comments, body, url, author, score, id, time-created and number of comments.
  3. For comments, only top level comments are considered in dataset and no sub-comments are present.
  4. The title, comments and body are cleaned by removing bad symbols and stopwords using nltk.
  5. Five types of features are considered for the the given task:
a) Title
b) Comments
c) Urls
d) Body
e) Combining Title, Comments, Body and Urls as one feature.
  1. The dataset is split into 70% train and 30% test data using train-test-split of scikit-learn.
  2. The dataset is then converted into a Vector and TF-IDF form.
  3. Then, the following ML algorithms (using scikit-learn libraries) are applied on the dataset:
a) Naive-Bayes
b) Linear Support Vector Machine
c) Logistic Regression
d) Random Forest
  1. Training and Testing on the dataset showed the Linear Support Vector Machine showed the best testing accuracy of 77.97% when trained on the combination of Title + Comments + Body + Url feature.
  2. The best model is saved and is used for prediction of the flair from the URL of the post.

Results

Title as Feature

Machine Learning Algorithm Test Accuracy
Naive Bayes 0.6792452830
Linear SVM 0.8113207547
Logistic Regression 0.8231132075
Random Forest 0.8042452830
MLP 0.8042452830

Body as Feature

Machine Learning Algorithm Test Accuracy
Naive Bayes 0.5636792452
Linear SVM 0.8278301886
Logistic Regression 0.8066037735
Random Forest 0.8207547169
MLP 0.7971698113

URL as Feature

Machine Learning Algorithm Test Accuracy
Naive Bayes 0.5754716981
Linear SVM 0.7523584905
Logistic Regression 0.7523584905
Random Forest 0.6886792452
MLP 0.7523584905

Comments as Feature

Machine Learning Algorithm Test Accuracy
Naive Bayes 0.4622641509
Linear SVM 0.4056603773
Logistic Regression 0.4716981132
Random Forest 0.4646226415
MLP 0.4599056603

Title + Comments + URL + Body as Feature

Machine Learning Algorithm Test Accuracy
Naive Bayes 0.5589622641
Linear SVM 0.8325471698
Logistic Regression 0.8254716981
Random Forest 0.8089622641
MLP 0.8372641509

Intuition behind Combined Feature

The features independently showed a test accuracy near to 82% with the URL feature giving the worst accuracies during the training.