Skip to content
This repository has been archived by the owner on Feb 18, 2020. It is now read-only.

Code4PuertoRico/suministrospr-web-scraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

28 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SuministrosPR Web Scraper

This repo is no longer needed. It was written to extract data from the original SuministrosPR site. The rewrite of SuministrosPR by the Code 4 Puerto Rico team breaks the scrapper.


This is a web scraper for the SuministrosPR.com website. It was developed in order to extract the data and ingest it into a new webapp developed at https://github.com/Code4PuertoRico/suministrospr.

Installation

You must have a working Ruby environment with version 2.6.5 and Bundler installed.

Clone the repo

git clone https://github.com/Code4PuertoRico/suministrospr-web-scraper
cd suministrospr-web-scraper

Install dependencies

bundle install

Executing scraper

bundle exec rake boom # chicken nuggets

For help, type:

bundle exec rake help

Docker

$ docker build --rm -t suministrospr-web-scraper .
$ docker run -it --rm -v "$PWD":/usr/src/app suministrospr-web-scraper bundle exec rake boom

You can also run bundle exec rake docker.

Data

Each post found at SuministrosPR.com will be saved under the ./data directory as an individual JSON file.

Additional details:

  • Entry will contain EMPTY_MUNICIPIO if municipio can't be parsed.
  • Entry will contain EMPTY_TITLE if title is empty.
  • Only the first 100 characters of the title will be used in filename.
  • Parser timestamp and DUPLICATE are appended to filename if file already exists.

About

Web scraper for the old SuministrosPR.com website.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •