The directory is a Flask web application set-up for hosting on Pivotal servers. The description of files and folders can be found below:
- app.py --The file used to start the Flask server.
- requirements.txt --Containing all Python dependencies of the project.
- Procfile -- Needed to setup Pivotal.
- templates --Folder containing HTML/CSS files.
- Models --Folder containing the saved model.
- Open the
Terminal
. - Clone the repository by entering
https://github.com/devil-cyber/Reddit-Flair-Detection
- Ensure that
Python3
andpip
is installed on the system. - Create a
virtualenv
by executing the following command:virtualenv -p python3 env
. - Activate the
env
virtual environment by executing the follwing command:source env/bin/activate
. - Enter the cloned repository directory and execute
pip install -r requirements.txt
. - Enter
python
shell andimport nltk
. Executenltk.download('stopwords')
and exit the shell. - Now, execute the following command:
python manage.py runserver
and it will point to thelocalhost
with the port. - Hit the
IP Address
on a web browser and use the application.
The following dependencies can be found in requirements.txt:
Going through various literatures available for text processing and suitable machine learning algorithms for text classification, I based my approach using [2] which described various machine learning models like Naive-Bayes, Linear SVM and Logistic Regression for text classification with code snippets. Along with this, I tried other models like Random Forest Algorithm. I have obtained test accuracies on various scenarios which can be found in the next section.
The approach taken for the task is as follows:
- Collect 1800 India subreddit data for each of the 15 flairs using
praw
module [1]. - The data includes title, comments, body, url, author, score, id, time-created and number of comments.
- For comments, only top level comments are considered in dataset and no sub-comments are present.
- The title, comments and body are cleaned by removing bad symbols and stopwords using
nltk
. - Five types of features are considered for the the given task:
a) Title
b) Comments
c) Urls
d) Body
e) Combining Title, Comments, Body and Urls as one feature.
- The dataset is split into 70% train and 30% test data using
train-test-split
ofscikit-learn
. - The dataset is then converted into a
Vector
andTF-IDF
form. - Then, the following ML algorithms (using
scikit-learn
libraries) are applied on the dataset:
a) Naive-Bayes
b) Linear Support Vector Machine
c) Logistic Regression
d) Random Forest
- Training and Testing on the dataset showed the Linear Support Vector Machine showed the best testing accuracy of 77.97% when trained on the combination of Title + Comments + Body + Url feature.
- The best model is saved and is used for prediction of the flair from the URL of the post.
Machine Learning Algorithm | Test Accuracy |
---|---|
Naive Bayes | 0.6792452830 |
Linear SVM | 0.8113207547 |
Logistic Regression | 0.8231132075 |
Random Forest | 0.8042452830 |
MLP | 0.8042452830 |
Machine Learning Algorithm | Test Accuracy |
---|---|
Naive Bayes | 0.5636792452 |
Linear SVM | 0.8278301886 |
Logistic Regression | 0.8066037735 |
Random Forest | 0.8207547169 |
MLP | 0.7971698113 |
Machine Learning Algorithm | Test Accuracy |
---|---|
Naive Bayes | 0.5754716981 |
Linear SVM | 0.7523584905 |
Logistic Regression | 0.7523584905 |
Random Forest | 0.6886792452 |
MLP | 0.7523584905 |
Machine Learning Algorithm | Test Accuracy |
---|---|
Naive Bayes | 0.4622641509 |
Linear SVM | 0.4056603773 |
Logistic Regression | 0.4716981132 |
Random Forest | 0.4646226415 |
MLP | 0.4599056603 |
Machine Learning Algorithm | Test Accuracy |
---|---|
Naive Bayes | 0.5589622641 |
Linear SVM | 0.8325471698 |
Logistic Regression | 0.8254716981 |
Random Forest | 0.8089622641 |
MLP | 0.8372641509 |
The features independently showed a test accuracy near to 82% with the URL
feature giving the worst accuracies during the training.