Seasonal Jobs.gov data scraper

Source code for a Django, PostGres & AWS Lambda - based tool for scraping data from seasonaljobs.dol.gov.

Code (c) Research Action Design, LLC. Originally produced for Centro de los Derechos del Migrante, Inc.

Released under a GPL v3 license, see LICENSE file for specific text of license.

Local set-up

Run PIPENV_VENV_IN_PROJECT=1 pipenv install
Configure environment variables in .env file. You'll need, at a minimum:
- JOBS_API_KEY = API key for requests to seasonaljobs.dol.gov's Microsoft Search back-end. Can be found by inspecting network requests on an individual job listing in web browser.
Run pre-commit install to install pre-commit hooks for Black python formatter.
Run python manage.py migrate --run-syncdb to create a local database.

Scraper commands

All of the scraper functionality can be run via Django's python manage.py ___ command.

scrape_rss - Download the most recent RSS feed of job listings and create/update listing records for each item in the feed.
scrape_listings - Query API for the data of a single listing, save it to the database, and download PDF of full job listing application and save to wherever local file uploads are stored.

Production deployment

The scraper is designed to run as an AWS Lambda function, saving listings to an RDS database and saving PDFs to an S3 bucket. The file lambda_function.py contains a lambda function handler which essentially passes through commands to the Django management command parser. If no command or an invalid command is set, the lambda handler just returns some basic stats.

In order to run on AWS, the following environment variables need to be set:

AWS_PGPASS - password for postgres user on RDS instance
AWS_PGHOST - domain name for RDS instance
USE_AWS - flag to use AWS, should be set to anything other than False
AWS_STORAGE_BUCKET_NAME - S3 bucket to store job order PDFs in

Additionally, the Lambda function must be in the same VPC as the RDS instance and have a role which has write access to the relevant S3 bucket. Lastly, the VPC needs to have a NAT Gateway in order for the scraper to successfully make outgoing requests. See this article for a full how-to.

To deploy to AWS

Run ./deploy-lambda.sh from the console. This command will create a zip file with all project dependencies (from .venv/../site-packages), a special AWS Lambda-friendly version of psycopg2 (from aws_psycopg2 directory, sourced from https://github.com/jkehler/awslambda-psycopg2) and the project code, and then upload it to AWS as a lambda function.

Scheduling on AWS

Schedule the scraper using Amazon Event Bridge. Event input should be fixed JSON, e.g.:

{"command": "scrape_rss"}

Running migrations on AWS.

To run migrations on AWS, set the .env variables to locally access the AWS PostGres database and run migrations as normal. Note that you will need to ensure that your IP is allowed as an inbound/outbound IP address in the security group.

Name		Name	Last commit message	Last commit date
Latest commit History 61 Commits
.circleci		.circleci
aws_psycopg2		aws_psycopg2
jobscraper		jobscraper
listings		listings
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
Pipfile		Pipfile
Pipfile.lock		Pipfile.lock
README.md		README.md
deploy-lambda.sh		deploy-lambda.sh
lambda_function.py		lambda_function.py
manage.py		manage.py
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Seasonal Jobs.gov data scraper

Local set-up

Scraper commands

Production deployment

To deploy to AWS

Scheduling on AWS

Running migrations on AWS.

About

Releases

Packages

Contributors 2

Languages

License

ResearchActionDesign/seasonal-job-scraper

Folders and files

Latest commit

History

Repository files navigation

Seasonal Jobs.gov data scraper

Local set-up

Scraper commands

Production deployment

To deploy to AWS

Scheduling on AWS

Running migrations on AWS.

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages