🗃 Open source self-hosted web archiving. Takes URLs/browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more...
-
Updated
Dec 13, 2024 - Python
🗃 Open source self-hosted web archiving. Takes URLs/browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more...
Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project.
Collect and revisit web pages.
A High-Fidelity Web Archiving Extension for Chrome and Chromium based browsers!
Serverless replay of web archives directly in the browser
Run a high-fidelity browser-based web archiving crawler in a single Docker container
InterPlanetary Wayback: A distributed and persistent archive replay system using IPFS
Webrecorder Player for Desktop (OSX/Windows/Linux). (Built with Electron + Webrecorder)
Streaming WARC/ARC library for fast web archive IO
🐋 Web Archiving Integration Layer: One-Click User Instigated Preservation
News crawling with StormCrawler - stores content as WARC
Bitextor generates translation memories from multilingual websites
WARC + AI - Experimental Retrieval Augmented Generation Pipeline for Web Archive Collections.
Chrome extension to "Create WARC files from any webpage"
Browsertrix is the hosted, high-fidelity, browser-based crawling service from Webrecorder designed to make web archiving easier and more accessible for all!
CoCrawler is a versatile web crawler built using modern tools and concurrency.
A toolkit for CDX indices such as Common Crawl and the Internet Archive's Wayback Machine
An Apache Spark framework for easy data processing, extraction as well as derivation for web archives and archival collections, developed at Internet Archive.
Add a description, image, and links to the warc topic page so that developers can more easily learn about it.
To associate your repository with the warc topic, visit your repo's landing page and select "manage topics."