This repo is no longer needed. It was written to extract data from the original SuministrosPR site. The rewrite of SuministrosPR by the Code 4 Puerto Rico team breaks the scrapper.
This is a web scraper for the SuministrosPR.com website. It was developed in order to extract the data and ingest it into a new webapp developed at https://github.com/Code4PuertoRico/suministrospr.
You must have a working Ruby environment with version 2.6.5
and Bundler installed.
git clone https://github.com/Code4PuertoRico/suministrospr-web-scraper
cd suministrospr-web-scraper
bundle install
bundle exec rake boom # chicken nuggets
For help, type:
bundle exec rake help
$ docker build --rm -t suministrospr-web-scraper .
$ docker run -it --rm -v "$PWD":/usr/src/app suministrospr-web-scraper bundle exec rake boom
You can also run bundle exec rake docker
.
Each post found at SuministrosPR.com will be saved under the ./data
directory as an individual JSON
file.
Additional details:
- Entry will contain
EMPTY_MUNICIPIO
ifmunicipio
can't be parsed. - Entry will contain
EMPTY_TITLE
iftitle
is empty. - Only the first 100 characters of the
title
will be used in filename. - Parser timestamp and
DUPLICATE
are appended to filename if file already exists.