-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Link directly to 990 PDFs #160
Comments
The Internet Archive has 900+ ISOs of filings from them IRS organized by date and type: I opened up a sample. Each contains the PDFs plus a manifest that has the file path, EIN, org name, filing type, date, and other metadata in a tab-delimited manifest file. |
I'm hopeful that someone already has these on S3. Otherwise, the process of scripting this won't be that hard and it'll cost us about $1/month to host. We'll just have to find a good way to index them; seems like preserving the existing structure is the most straightforward (year+type/ein+year+type.pdf), and store the lookup in a single flat table |
Following the instructions here: Do an advanced search and ask for a CSV: https://archive.org/advancedsearch.php?q=collection:IRS990 AWS machine for processing
ebs mounted at |
Here's the list of 990 uploads on the archive: https://gist.github.com/hampelm/c5e22d1ac19bea8fd57b44aee4f09962 Work-in-progress wget command to capture a single one:
Probably want to add a column called "s3path" to the file to define where each one will be uploaded on s3, since the directory paths vary Downloads are running pretty slow (2-3MB/s on EC2) so this first part will take a while; next step will be to mount the ISOs with something lke
|
As much as we dislike 'em, the 990 PDFs aren't going away. Since they provide so much info and are behind a login wall in many places, it'd be helpful to link directly to them on organization pages.
To do:
The text was updated successfully, but these errors were encountered: