This is a version of versionista-outputter that has been rewritten in Node.js and JSDom.
Why? Speed is important here. Scraping Versionista can take a long time. We don’t need the overhead of a browser (like loading and executing images, CSS, and JavaScript) because all the necessary content is in the inital HTML payload. Parallelizing operations is also a little easier (for me, at least) in Node than in Ruby—and we absolutely ought to be doing more in parallel.
You’ll need Node.js. Then you should be able to globally install this with:
$ npm install -g https://github.com/edgi-govdata-archiving/web-monitoring-versionista-scraper.git
Then run it like so:
$ scrape-versionista --email EMAIL --password PASSWORD --after '2017-03-22' --format csv --output './scrape/versions.csv'
You can also split output into multiple files (by site) with the --group-by-site
option:
$ scrape-versionista --email EMAIL --password PASSWORD --after '2017-03-22' --format csv --output './scrape/versions.csv' --group-by-site
Alternatively, you can clone this repo, then:
$ yarn install
# Or if you don't have yarn:
$ npm install
# And run it:
$ ./bin/scrape-versionista --email EMAIL --password PASSWORD --after '2017-03-22' --format csv --output './scrape/versions.csv'
This has the same basic capabilities as versionista-outputter
, but can also save the versioned HTML (and diffs).
For basic info:
$ scrape-versionista --help
-
--email STRING
Required! The E-mail address of Versionista Account. You can also use an env var instead:VERSIONISTA_EMAIL
-
--password STRING
Required! The password of Versionista Account. You can also use an env var instead:VERSIONISTA_PASSWORD
-
--after DATE|HOURS
Only check versions captured after this date. It can be an ISO 8601 date string like2017-03-01T00:00:00Z
or a number, representing hours before the current time. -
--before DATE|HOURS
Only check versions captured before this date. It can be an ISO 8601 date string like2017-03-01T00:00:00Z
or a number, representing hours before the current time. -
--format FORMAT
The output format. One of:csv
,json
,json-stream
. [default:json
] -
--output FILEPATH
Write output to this file instead of directly to your console on stdout. -
--save-content
If set, the raw HTML of each captured version will also be saved. Files are written to the working directory or, if--output
is specified, the same directory as the output file. -
--save-diffs
If set, the HTML of diffs between a version and its previous version will also be saved. Files are written to the working directory or, if--output
is specified, the same directory as the output file. -
--latest-version-only
If set, only the latest version (of the versions matching --after/--before times) for each page is captured. -
--group-by-site
If set, a separate output file will be generated for each site. Files are placed in the same directory as--output
, so the actual filename specified in--output
will never be created.
ALL the options!
$ scrape-versionista --email 'somebody@somewhere.com' --password somepassword --after '2017-02-01' --before '2017-03-01' --format csv --output './scrape/versions.csv' --save-content --save-diffs
Use environment variables for credentials:
$ export VERSIONISTA_EMAIL='somebody@somewhere.com'
$ export VERSIONISTA_PASSWORD=somepassword
$ scrape-versionista --after '2017-02-01' --before '2017-03-01' --format csv --output './scrape/versions.csv' --save-content --save-diffs
Specifying time as hours ago instead of a date:
# Starting 5 hours ago
$ scrape-versionista --after 5
# Decimals are accepted, so you can start 30 minutes ago, too
$ scrape-versionista --after 0.5
The bin
directory contains several other scripts besides scrape-versionista
. They’re all closely related and perform helper tasks that are important in EDGI’s workflow around Versionista. You can use the --help
option with all of them to see details about arguments, options, and usage.
-
scrape-versionista-and-email
runsscrape-versionista
, then compresses the results into a single.tar.gz
archive and e-mails them to a specified address. -
scrape-versionista-and-upload
runsscrape-versionista
, uploads the resulting files to Amazon S3 and Google Cloud Storage, and finally imports them into an instance of web-monitoring-db. -
upload-to-google
uploads a directory’s contents to Google Cloud Storage. (Used as part ofscrape-versionista-and-upload
.) -
upload-to-s3
uploads a directory’s contents to Amazon S3. (Used as part ofscrape-versionista-and-upload
.) -
import-to-db
sends the contents of a JSON-stream file listing versions that was generated byscrape-versionista
to an instance of web-monitoring-db. (Used as part ofscrape-versionista-and-upload
.) -
query-db-and-email
queries a web-monitoring-db instance for pages that were updated with new versions during a given time frame and e-mails a compressed.tar.gz
archive of the results to a specified address. Results are CSV files — one per combination of tags specified with the--group-by
option.NOTE: this will soon be deprecated in favor of web-monitoring-task-sheets.
-
get-versionista-metadata
andget-versionista-page-chunk
are for advanced usage loading extremely large amounts of data from Versionista. Seebackfilling-data.md
for usage instructions.
For details about how this tool is deployed to automatically scrape Versionista in production, see deployment.md
.
This repository falls under EDGI's Code of Conduct.
We love improvements to our tools! EDGI has general guidelines for contributing to all of our organizational repos.
Copyright (C) 2017 Environmental Data and Governance Initiative (EDGI) This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, version 3.0.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
See the LICENSE
file for details.