A web-crawler/scraper script to fetch reddit posts and save them in CSV files. Search is performed with given keywords in specific subreddits to fetch the reddit posts. Performs sentiment analysis to quantify the posts using pre-trained sentiment analysis models like Flair, Textblob and VADER. The results are saved in CSV.
download_data_from_reddit.py
- Is a scraper script which can search reddit post using keywords, in a subreddit of interest.
- It uses pushshift api from https://api.pushshift.io/. There is no need get API secret keys from reddit.com to use pushshift APIs (as of this writing).
- sample data generated by the script looks like this.
reddit_post_sentiment_analysis.py
- Takes csv file generated by download_data_from_reddit.py
- Combiles title and subtext columns, and perform sentiment analysis.
- Performs flair (https://pypi.org/project/flair/), textblob (https://pypi.org/project/textblob/), and VADER (https://www.nltk.org/_modules/nltk/sentiment/vader.html) NLP processing to get sentiment scores.
- Sample data generated at this stage looks like this.
- Bucketize the rows to combine all values for each hour. Sentiment scores are averaged and missing values are set to 0.
- Sample data generated finaly looks like this.
This framework is used in https://github.com/pratikpv/predicting_bitcoin_market
Credits: Code from https://medium.com/@RareLoot/using-pushshifts-api-to-extract-reddit-submissions-fb517b286563 is referenced as base to write scraper code.