GitHub - clytwynec/truth-in-testimony-scraper: Using Selenium, crawls docs.house.gov for TTF files

Truth in Testimony Form Crawler

Crawls docs.house.gov and grabs TTF files. Puts files into S3 and stores metadata in a CSV.

Background

This code was used in an investigative journalism project looking at the implementation and effectiveness of the "Truth in Testimony" rule in the U.S. House of Representatives which requires those who testify before the House to disclose any foreign funding related to the hearing at which they testify. The final product includes a 4000+ word story, a database of the testimonies and an description of our process, and a "by the numbers" overview of our findings. This code is not generalized, but kept available for transparency and as a resource for others interested in similar projects.

Installation

pip install -r requirements.txt

Set up AWS credentials to save the files to a bucket. See https://github.com/boto/boto3#quick-start.

Usage

To see options, use python crawl_ttf.py -h

Post-crawl processing

Concatenating the years

merge_csv.py was used to combine the crawls from separate years into one file.

Removing duplicates

drop_duplicates.py was used to remove duplicate rows from CSV file by UID.

Truth in Testimony forms

After crawling, we went through the Truth in Testimony forms manually and recorded the diclosed foreign funding. I used merge_data.py to merge that data with the originally crawled data.

Identifying think tanks

id_think_tanks.py creates a copy of the data with a column think_tank that contains the name of the think tank named in the witness_desc field. This works off of a list of known think tanks and isn't necessarily complete.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
.gitignore		.gitignore
README.md		README.md
boto_util.py		boto_util.py
crawl_ttf.py		crawl_ttf.py
id_think_tanks.py		id_think_tanks.py
merge_csv.py		merge_csv.py
merge_data.py		merge_data.py
requirements.txt		requirements.txt
ttf_crawler_gif.gif		ttf_crawler_gif.gif

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Truth in Testimony Form Crawler

Background

Installation

Usage

Post-crawl processing

Concatenating the years

Removing duplicates

Truth in Testimony forms

Identifying think tanks

About

Releases

Packages

Languages

clytwynec/truth-in-testimony-scraper

Folders and files

Latest commit

History

Repository files navigation

Truth in Testimony Form Crawler

Background

Installation

Usage

Post-crawl processing

Concatenating the years

Removing duplicates

Truth in Testimony forms

Identifying think tanks

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages