A Reddit Flair Detector web application to detect flairs of India subreddit posts using Machine Learning algorithms. The application can be found live at Reddit Flair Detector.
contains code used to scrape data off many posts from reddit using praw api and save all that data in a csv file. goto notebook to know more
contains code used to do the data analysis and train various machine learning models and check the accuracy on different features. The models giving best results were downloaded.
- Title - Title of the post
- Url - URL of the post
- Body - Body of the post
- Comments - Comments of the post
- All features combined - Combination of all the above mentioned features
- Naive Bayes Classifier
- Linear SVM - Usually converged after 5-7 iterations
- Logistic Regression - Usually converged after 100-150 iterations
- Random Forrest - Showed the best results at most times
- Basic Neural Network - Tried varioius approaches adjusting number of layers ,number of neurons per layer and number of iterations but did not show good results
goto notebook to know more
the directory contains 4 different csv files containg data that was scraped using Data Scrape notebook
- data1.csv
- Initial Data Scraping of 100 posts per category
- data2.csv
- Data Scraping of categories done individually so that i can create a dataset with zero empty values
- Done because their was very poor performance on flair prediction using body as a feature so I thought that was because of some posts not having body in the dataset
- data3.csv
- Data scraping of 100 posts per category but redifing combined_features
- after data2 their was minimal improvement in body so I thought that body is not a useful feature thats why I excluded body from combined features
- data4.csv
- similar to data3 it just has 150 posts per category to train on a larger dataset
contains best performing models trained on all data.csv files. Random Forrest models could not be uploaded due to their large size
contains the flask implementation of the app. It takes in a url for reddit india post and gives the prediction according to the trained models
Create a virtual environment, install the dependencies, start the server.
$ virtualenv -p python3 env
$ source env/bin/activate
$ pip install -r requirements.txt
$ python3 app.py
Machine Learning Algorithm | Accuracy_data1.csv | Accuracy_data2.csv | Accuracy_data3.csv | Accuracy_data4.csv |
---|---|---|---|---|
Naive Bayes | 0.65 | 0.63 | 0.67 | 0.65 |
Linear SVM | 0.71 | 0.65 | 0.72 | 0.69 |
Logistic Regression | 0.72 | 0.68 | 0.73 | 0.69 |
Random Forest | 0.70 | 0.66 | 0.74 | 0.69 |
MLP | 0.47 | 0.48 | 0.54 | 0.52 |
Machine Learning Algorithm | Accuracy_data1.csv | Accuracy_data2.csv | Accuracy_data3.csv | Accuracy_data4.csv |
---|---|---|---|---|
Naive Bayes | 0.25 | 0.26 | 0.24 | 0.27 |
Linear SVM | 0.38 | 0.35 | 0.35 | 0.39 |
Logistic Regression | 0.37 | 0.33 | 0.29 | 0.36 |
Random Forest | 0.35 | 0.35 | 0.40 | 0.37 |
MLP | 0.28 | 0.26 | 0.24 | 0.27 |
Machine Learning Algorithm | Accuracy_data1.csv | Accuracy_data2.csv | Accuracy_data3.csv | Accuracy_data4.csv |
---|---|---|---|---|
Naive Bayes | 0.36 | 0.34 | 0.32 | 0.36 |
Linear SVM | 0.35 | 0.40 | 0.40 | 0.43 |
Logistic Regression | 0.38 | 0.41 | 0.37 | 0.44 |
Random Forest | 0.38 | 0.37 | 0.40 | 0.44 |
MLP | 0.25 | 0.32 | 0.35 | 0.39 |
Machine Learning Algorithm | Accuracy_data1.csv | Accuracy_data2.csv | Accuracy_data3.csv | Accuracy_data4.csv |
---|---|---|---|---|
Naive Bayes | 0.53 | 0.53 | 0.55 | 0.53 |
Linear SVM | 0.72 | 0.71 | 0.68 | 0.68 |
Logistic Regression | 0.75 | 0.75 | 0.68 | 0.71 |
Random Forest | 0.78 | 0.78 | 0.72 | 0.70 |
MLP | 0.47 | 0.52 | 0.40 | 0.41 |
- http://www.storybench.org/how-to-scrape-reddit-with-python/
- https://towardsdatascience.com/scraping-reddit-data-1c0af3040768
- https://praw.readthedocs.io/en/latest/