Skip to content

leukipp/reddit-data

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Reddit Data

Download submissions from selected subreddits. The data is exported as .csv file, all times in UTC:

Column Description Type
submission The id of the submission string
subreddit The subreddit name string
author The redditors username string
created Time the submission was created number
retrieved Time the submission was retrieved number
edited Time the submission was modified number
pinned Whether or not the submission is pinned number
archived Whether or not the submission is archived number
locked Whether or not the submission is locked number
removed Whether or not the submission is mod removed number
deleted Whether or not the submission is user deleted number
is_self Whether or not the submission is a text number
is_video Whether or not the submission is a video number
is_original_content Whether or not the submission has been set as original content number
title The title of the submission string
link_flair_text The submission link flairs text content string
upvote_ratio The percentage of upvotes from all votes on the submission number
score The number of upvotes for the submission number
gilded The number of gilded awards on the submission number
total_awards_received The number of awards on the submission number
num_comments The number of comments on the submission number
num_crossposts The number of crossposts on the submission number
selftext The submission selftext on text posts string
thumbnail The submission thumbnail on image posts string
shortlink The submission short url string

Install

Install python3 and pip3, you will also need git.

Packages

sudo apt install libsnappy-dev
pip3 install -r requirements.txt

Source

git clone https://github.com/leukipp/reddit-data
cd reddit-data

Environment

Create file .streamlit/secrets.toml and set environment variables:

# application
USER_AGENT="python:https://github.com/[USER]/[REPOSITORY]"

# reddit api
REDDIT_CLIENT_ID="[...]"
REDDIT_CLIENT_SECRET="[...]"

# kaggle api (optional)
KAGGLE_USERNAME="[...]"
KAGGLE_KEY="[...]"

Kaggle is only required if you want to upload the dataset on a regular basis. In that case, you will need to create a config/kaggle.json file, similar to the dataset-metadata.json file.

Run

Adapt the start time (unix timestamp) in config/loader.json and run:

python3 data.py <subreddit1> <subreddit2> <subreddit3> ...

Download

Feel free to download some of the existing datasets available on Kaggle as well.

License

MIT

Releases

No releases published

Packages

No packages published

Languages