GitHub - internetarchive/sandcrawler: Backend, IA-specific tools for crawling and processing the scholarly web. Content ends up in https://fatcat.wiki

                                  _                         _           
__________    ___  __ _ _ __   __| | ___ _ __ __ ___      _| | ___ _ __ 
\         |  / __|/ _` | '_ \ / _` |/ __| '__/ _` \ \ /\ / / |/ _ \ '__|
 \        |  \__ \ (_| | | | | (_| | (__| | | (_| |\ V  V /| |  __/ |   
  \ooooooo|  |___/\__,_|_| |_|\__,_|\___|_|  \__,_| \_/\_/ |_|\___|_|

This repo contains back-end python workers, scripts, config files, and other stuff related to the Internet Archive web group's scholarly web preservation and processing pipeline. It is a complement to fatcat, which is an open catalog of research outputs, including preservation metadata.

The sandcrawler part of the project deals with content crawled from the web into either web.archive.org or archive.org collections, and post-processing that content. For example, extracting text from PDF files, verifying mimetypes, and checking archival status. The resulting metadata ends up getting filtered, transformed, and pushed in to fatcat itself for public use.

While code in this repository is public, it is mostly IA-specific and may not even run outside the IA data centers due to library dependencies and authentication needs. Code quality and documentation is generally poor compared to fatcat.

As of December 2022, the best document to read for "getting started" in understanding the ingest system is proposals/2019_ingest.md, and then subsequent proposals expanding on that foundation.

Archive-specific deployment/production guides and ansible scripts at: journal-infra

Repository Layout

./python/ contains scripts and utilities for ingesting content from wayback and/or the web (via save-page-now API), and other processing pipelines. Most of the active code is in here. See included README (./python/README.md)

./sql/ contains schema, queries, and backfill scripts for a Postgres SQL database index (eg, file metadata, CDX, and GROBID status tables).

./python_hadoop/ contains Hadoop streaming jobs written in python using the mrjob library. Still use the HBase backfill code path occasionally.

./proposals/ design documentation and change proposals

./notes/ingest/ log of bulk crawls and metadata loads

./extra/docker/ docker-compose setup that may be useful for documentation (includes Kafka, PostgreSQL, etc)

./.gitlab-ci.yml current CI setup script, which documents dependencies

./pig/ contains a handful of Pig scripts, as well as some unittests implemented in python. Only rarely used.

./scalding/ contains Hadoop jobs written in Scala using the Scalding framework. The intent is to write new non-trivial Hadoop jobs in Scala, which brings type safety and compiled performance. Mostly DEPRECATED, this code has not been run in years.

Running Python Hadoop Jobs

The ./please python3 wrapper script is a helper for running jobs (python or scalding) on the IA Hadoop cluster. You'll need to run the setup/dependency tasks first; see README files in subdirectories.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Repository Layout

Running Python Hadoop Jobs

About

Releases

Contributors 4

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1,538 Commits
extra		extra
kafka		kafka
match_test_data		match_test_data
notes		notes
pig		pig
proposals		proposals
python		python
python_hadoop		python_hadoop
scalding		scalding
sql		sql
.gitignore		.gitignore
.gitlab-ci.yml		.gitlab-ci.yml
CONTRIBUTORS		CONTRIBUTORS
Dockerfile.sandcrawler-pytest		Dockerfile.sandcrawler-pytest
README.md		README.md
TODO		TODO
fetch_hadoop.sh		fetch_hadoop.sh
please		please

internetarchive/sandcrawler

Folders and files

Latest commit

History

Repository files navigation

Repository Layout

Running Python Hadoop Jobs

About

Topics

Resources

Stars

Watchers

Forks

Releases

Contributors 4

Languages