Skip to content

Latest commit

 

History

History
110 lines (63 loc) · 4.92 KB

README.md

File metadata and controls

110 lines (63 loc) · 4.92 KB

unitedstates/congress

Public domain code that collects data about the bills, amendments, roll call votes, and other core data about the U.S. Congress.

Includes:

  • A scraper for THOMAS.gov, the official source of information on the life and times of legislation and presidential nominations in Congress.

  • Scrapers for House and Senate roll call votes.

  • A scraper for GPO FDSys, the official repository for most legislative documents.

Read about the contents and schema in the documentation in the github project wiki.

For background on how this repository came to be, see Eric's blog post.

Setting Up

The scripts are tested with Python 2.7. On Ubuntu, you'll need these packages (the last three are required for the lxml python package):

sudo apt-get install git python-virtualenv python-dev libxml2-dev libxslt1-dev libz-dev

It's recommended you first create and activate a virtualenv with:

virtualenv virt
source virt/bin/activate

You don't have to call it "virt", but the project's gitignore is set up to ignore it already if you do.

Whether or not you use virtualenv:

pip install -r requirements.txt

Collecting the data

The general form to start the scraping process is:

./run <data-type> [--force] [--fast] [other options]

where data-type is one of:

To scrape bills and resolutions from THOMAS, run:

./run bills

The bills script will output bulk data into a top-level data directory, then organized by Congress number, bill type, and bill number. Two data output files will be generated for each bill: a JSON version (data.json) and an XML version (data.xml).

Common options

The scripts will cache all downloaded pages, and it will not re-fetch them from the network unless a force flag is passed:

./run bills --force

The --force flag applies to all data types. Since the --force flag forces a download and parse of every object, the --fast flag for bills and votes will attempt to process only objects that are believed to have changed. Always use --fast with --force.

Debugging messages are hidden by default. To include them, run with --log=info or --debug. To hide even warnings, run with --log=error.

To get emailed with errors, copy config.yml.example to config.yml and fill in the SMTP options. The script will automatically use the details when a parsing or execution error occurs.

Data Output

The script will cache downloaded pages in a top-level cache directory, and output bulk data in a top-level data directory.

Two bulk data output files will be generated for each object: a JSON version (data.json) and an XML version (data.xml). The XML version attempts to maintain backwards compatibility with the XML bulk data that GovTrack.us has provided for years. Add the --govtrack flag to get fully backward-compatible output using GovTrack IDs (otherwise the source IDs used for legislators is used).

See the project wiki for documentation on the output format.

Contributing

Pull requests with patches are awesome. Unit tests are strongly encouraged (example tests).

The best way to file a bug is to open a ticket.

Running tests

To run this project's unit tests:

./test/run

Who's Using This Data

The Sunlight Foundation and GovTrack.us are the two principal maintainers of this project.

Both Sunlight and GovTrack operate APIs where you can get much of this data delivered over HTTP:

License

All data and code in this project is licensed under CC0 (summary).