This project crawls the HackerNews website and scrapes data about the current top stories. The scraped stories are then written to STDOUT
in JSON
format.
HackerNews provides an API that enables clients consume information about the top posts. For our use case though, consuming the API would have proved inefficient because, in the worst case scenario we would need to make 100+ network requests to fetch the top 100 stories.
This solution makes a maximum of 4 network requests, as opposed to 100+ API calls it would have taken to fetch the top 100 posts with the HackerNews API.
-
Download and install Node.js here. Skip this step if you already have Node installed.
-
Download and install Git here. Skip this as well if you have Git already installed on your computer.
-
Open a command line window from a newly created folder and run the following command;
git clone https://github.com/iifeoluwa/hn-scraper.git .
- From the same command line window, run
npm install -g
After completing the steps above, you can run the tool from any command line window using hackernews
. It also accepts a --posts
argument that specifies the number of stories it should return.
To run tests, run npm test
from the project directory.
hackernews --posts 1
// Writes to STDOUT
[ { title: 'Lambda School Announces $14M Series A Led by GV',
uri: 'https://lambdaschool.com/blog/lambda-school-announces-14-million-series-a-led-by-gv/',
author: 'tosh',
points: '31',
comments: '17',
rank: '1' } ]
The following libraries were used to create this tool;
- Got: A lightweight HTTP request library. Used this because the project required making simple GET requests, and it is one of the lightest, actively maintained library for making HTTP requests.
- Cheerio: Cheerio was used to parse the HTML document and extract the needed data from the file. It provides an expressive API that makes it easy to find specific information in documents.
- Minimist: Parses the arguments passed to
hackernews
tool. Makes it easier to handle and validate inputs. - joi: Tool used to enforce validation rules and ensure only validated stories are retrieved.