Skip to content

Reddit web-scraper using the Python praw and psaw packages and azure storage

License

Notifications You must be signed in to change notification settings

jarretjeter/reddit-scraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

43 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

reddit-scraper

By Jarret Jeter

Reddit web-scraper project using the Python Reddit API Wrapper (praw) and the Python Pushshift.io API Wrapper (psaw) for the general purpose of getting information on user comments on a particular subject.

Technologies Used

  • Azure Storage
  • Python
  • Pandas
  • Praw
  • Psaw
  • Typer

Description

With Python's Praw package I instantiated a Reddit object using a Reddit application to be used by the PushShift API Wrapper. The Reddit object can access many things through Reddit's API such as subreddits, users, topics, and comments. I defined some functions to structure submission and comment data csv's as I like from Reddit, and merging multiple batches of data to a csv if required. I then have some functions defined in blobs.py to upload/download to/from an azure storage container.

Setup/Installation Requirements

A reddit account + application and azure storage account + container are required to run this code. Also make sure you have a text editor such as Visual Studio Code installed, a python3.8 virtual environment active, and linux bash terminal to use

  • Clone this repository (https://github.com/jarretjeter/reddit-scraper.git) onto your local computer from github
  • In VS Code or another text editor, open this project
  • With your terminal, enter the command 'pip install -r requirements.txt' to get the necessary dependencies
  • You will have to create environment variables saved to your reddit application personal use script (REDDSCRP_PU_SCRIPT), secret token (REDDSCRP_SECRET) and azure storage account connection string (REDDIT_STUFF_CONN_STR)
  • Create a directory named "data" to save any files to
  • Once everything is set up, in the command line you can run the scraper2.py functions (for example: "python scraper2.py fetch_threads [subreddit] [subject] or "python scraper2.py fetch_comments [subreddit] [subject]") to retrieve data from reddit
  • For blobs.py, you can run python blobs.py containers to list all the containers in your storage account. Run "python blobs.py upload [filename] [container_name]" or "python blobs.py download [filename] [container_name]" to upload/download to or from your storage container.

Known Bugs

  • No known bugs.

License

MIT

Let me know if you have any questions at jarretjeter@gmail.com

Copyright (c) 11/11/2022 Jarret Jeter

About

Reddit web-scraper using the Python praw and psaw packages and azure storage

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages