Skip to content

kujaomega/crawl_subreddit

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Assumptions:

In this project I assume that the reddit endpoint ("/r/Python.json") gives you the latest posts ordered by the recent update.

As there is a throttling time in the api calls, I have not implemented parallellism, the bottleneck is the 600 api calls in 600 seconds.

I have choosen to do use AWS lambdas to make this project as It is a stadistics api and will not have a need to be requested every second. For this reason lambdas will decrease the costs of the api. As AWS have got a free tier of service and Mongodb offer a free database, I have implemented the services in a free way.

To use the service I have created first a Mongodb database and I have stored the dabase information in a mongo_db.json file. To execute the files I assume you are using Ubuntu and you have docker and pyhton3.6 installed. You also need to have AWS credentials stored in your "~/.aws/" folder. First of all, to crawl all the data to the database, execute the reddit_crawler/reddit_crawler.py. Then you need to execute the reddit_crawler/deploy_lambda.sh to deploy the lambda for constantly update database. You need to execute the api_endpoint/deploy_lambda.sh to create the api endpoints.

BONUS:

I choosen the explained architecture for an ease to implement this step.

I have implemented top submitters and top commenters as a personal challenge but I have not implemented the most active users due the lack of time.

I have implemented all posts by user and all posts a user commented as a personal challenge.

I have implemented the average comment karma for a user as a challenge but I have not implemented the top 10 most valued users due the lack of time.

I have implemented a script for deploying the api endpoint buy I have not implemented the tests in the continuous integration due the lack of time.

The endpoints are not working ATM

ENDPOINTS: The api calls will be available a week or until a response

GET - https://jg6l4dw34i.execute-api.eu-central-1.amazonaws.com/dev/top10punctuation

queryparam: rank["all", "discussion" or "external"]

GET - https://jg6l4dw34i.execute-api.eu-central-1.amazonaws.com/dev/top10comments

queryparam: rank["all", "discussion" or "external"]

GET - https://jg6l4dw34i.execute-api.eu-central-1.amazonaws.com/dev/top10submitters

GET - https://jg6l4dw34i.execute-api.eu-central-1.amazonaws.com/dev/top10commenters

GET - https://jg6l4dw34i.execute-api.eu-central-1.amazonaws.com/dev/allPostsByUser

queryparam: author

GET - https://jg6l4dw34i.execute-api.eu-central-1.amazonaws.com/dev/allPostsByUserComments

queryparam: author

GET - https://jg6l4dw34i.execute-api.eu-central-1.amazonaws.com/dev/averageCommentKarma

queryparam: author

About

This is a subreddit crawler

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published