Archive of historically trending GitHub repositories on Hacker News. See it live at https://before.kevinhu.io.
Hacker News is a great place to find high-quality GitHub repositories. However, posted repositories are quickly pushed out by new submissions, making older ones hard to discover. This project scrapes and presents all linked GitHub repositories since inception.
- Hacker News posts mentioning 'github.com' are scraped from the Algolia API in
scraper/1a_fetch_hackernews.py
. It takes about five minutes to download every post since 2008, which is when GitHub was founded. - The raw JSON files from the Algolia API are consolidated and stored in
.feather
format for fast loading inscraper/2a_convert_hackernews.py
. - The consolidated posts are grouped by day, sorted in descending popularity, and output to a single JSON file for the web client by
scraper/3a_aggregate_hackernews.py
. - The web client takes the
.json
file and uses it to render the posts. This is a standard React app that is deployed to GitHub pages. After compiling with webpack and compressing, the total size of the site is about 7 MB.
- Install Python dependencies with
poetry install
- Activate virtual environment with
poetry shell
- Install JavaScript dependencies with
yarn install
- Start the client with
yarn start
Note that the scraper and frontend are more or less independent with the exception of the final .json
output.
Initially, I also intended to use Reddit posts as an orthogonal source of recommendations. However, I found that Reddit's linked repositories are usually of much lower quality and included many bots, so I no longer consider them.