A concise cheat-sheet of commands and tools for scraping, saving, hoarding, archiving, collecting, organising and browsing data.
Inspired by Reddit's /r/DataHoarder
Which archiving tool should you choose for each web service?
-
Amazon Video: Unknown. Check torrents instead.
-
BBC iPlayer: youtube-dl / yt-dlp
-
Discord: DiscordChatExporter (see below for notes)
-
Mediawiki website: Native dump using
/wiki/Special:AllPages
and/wiki/Special:Export
. -
Netflix: Unknown. Check torrents instead.
-
Reddit: Various tools
- Tools to save whole threads
- "Print" method for threads
- Change
www.reddit.com
toold.reddit.com
-- all comments will now be expanded - Sort by: New
- Use cleanly print chrome extension
- Click to remove areas, also click to 'tag' areas for printing.
- Change
- Historial data dumps: the-eye / torrents
-
SoundCloud: youtube-dl / yt-dlp
-
Tumblr: TumblThreeApp (Windows). Viewers: 1, 2.
-
Twitter: ThreadReaderApp
-
Torrents: Use unblockit for a list of torrent sites. Official Twitter / Reddit.
-
Private torrent trackers: Might contain any TV or movie ever broadcat. It can be difficult to get an invite, and you may need to maintain an upload ratio.
-
Individual web pages:
- Save as | Web Page, HTML Only
- Save as | Web Page, Single File
- Save as | Web Page, Complete
- Print | Save as PDF
- Chrome extension SingleFile <-- Recommended!
-
Websites generally: wget, httrack or ArchiveBot.
-
Youtube video/music: youtube-dl (see below for notes) / yt-dlp
-
Radio scrobbling / Music identification: Shazam or AHA Music finder
- Radio scrobbling
- Play radio station with low quality playlist: La Mega, Malaga.
- Install chrmoe browser extension Shazam or AHA Music finder
- On Linux use
xdotool
to automate clicking on chrome browser extension icons to activate music identification:watch "xdotool mousemove 3442 90 click 1; sleep 20; xdotool mousemove 3476 90 click 1; sleep 20"
(adjust coords as needed) - Does not require speakers to be on
Details of precise sets of commands.
- wget for websites
wget \
-e 'robots=off' \
--accept '*.*' \
--mirror \
--wait 2 \
--random-wait \
--convert-links \
--user-agent 'Mozilla/5.0 (Windows NT 10.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.7113.93 Safari/537.36' \
'http://www.example.com/'
-
StreamRipper for music
- Example:
streamripper ###URL### -u "FreeAmp/2.x" -q -l 86400
- Example:
-
Chrome DevTools for anything via a web browser
-
Mediawiki for wiki sites
- For an XML dump containing wikitext...
- Copy names of pages from
/wiki/Special:AllPages
... - Paste into
/wiki/Special:Export
- (optional) Parse resulting wikitext with mwparserfromhell.
-
youtube-dl / yt-dlp for Youtube and other video/audio
- Video
yt-dlp \
--ignore-errors \
--format 'bestvideo[ext=mp4]+bestaudio[ext=m4a]/best[ext=mp4]/best' \
--output "%(playlist_title)s/%(title)s.%(ext)s" \
--throttled-rate 10K \
###URL###
- Audio
yt-dlp \
--ignore-errors \
--extract-audio \
--audio-quality 0 \
--audio-format mp3 \
--prefer-ffmpeg \
--output "%(playlist_title)s/%(artist)s - %(title)s.%(ext)s" \
--throttled-rate 10K \
###URL###
- Audio album playlist
yt-dlp \
...etc... \
--output "%(artist)s - %(album)s/%(artist)s - %(album)s - %(playlist_index)02d - %(track)s.%(ext)s" \
###URL###
- Video playlist
yt-dlp \
...etc... \
--output "%(playlist_title)s/%(playlist_index)03d - %(artist)s - %(title)s.%(ext)s" \
###URL###
- Multiple playlists
for URL in $(cat list)
do
yt-dlp ...etc... "$URL"
done
- DiscordChatExporter + excellent wiki
- Example:
docker run --rm -v /var/www/zaphod/adhd:/app/out tyrrrz/discordchatexporter:stable export --channel ###ID### --token ###SECRET### --format Json
- List guilds:
docker run tyrrrz/discordchatexporter:stable guilds
- List channels:
docker run tyrrrz/discordchatexporter:stable channels --guild ###ID###
- Example:
-
- Example:
jq -j -M --stream -f discord1.jq
discord1.jq
- Example:
-
- Example: Ctrl-Shift-X (or Command-Shift-X on Mac)
-
HAR recorders (if for some reason Chrome's "Save as HAR" feature isn't sufficient)
-
HAR extractors (to retrieve the original content from inside a HAR file)
- After extracting the .m4u8 and .ts files from HAR, run something like:
ffmpeg -i playlist.m3u8 -c copy -bsf:a aac_adtstoasc output.mp4
Input: https://music.youtube.com/library/playlists Goal: Extract a list of playlists suitable for feeding to youtube-dl / yt-dlp
These are all equivalent ways to achieve the same thing:
- Chrome: Save As | Web Page, HTML Only --> doesn't work, empty page
- Chrome: Save As | Web page, Single File --> works, full HTML, embeds images, uses "quoted printable encoding", i.e.
=
becomes=3D
- Chrome: Save As | Web page, Complete --> works, full HTML, not encoded, saves album/playlist covers as image files.
- Chrome: DevTools | Elements | | right-click | Copy | Copy element | Paste into text editor --> works, full HTML
- Chrome: Extensions | XPath Helper | Ctrl-Shift-X | Hover over element | Shift | Edit XPath to remove e.g.
[409]
| Append/@href
--> works, list of URLs - Chrome: DevTools | Console | | |
$(document).xpathEvaluate('//body/div/foo')
- Chrome: DevTools | Elements | right-click | Copy | Copy JS | (paste into console and edit - see snippet below)
- Chrome: Extensions | AutoHAR | chrome --auto-open-devtools-for-tabs | ...etc
- Chrome: DevTools | Network | Filter | Fetch/XHR | https://music.youtube.com/youtubei/v1/browse/...etc... | (a) Save all as HAR with content, (b) (down-arrow near top-right) Export HAR...
- (Idea) Headless chrome + puppeteer or playwright
Javascript snippet:
items = document.querySelectorAll("#items > ytmusic-two-row-item-renderer");
items.forEach((item) => {
drill = item.querySelector("div.details.style-scope.ytmusic-two-row-item-renderer");
span = drill.querySelector('span > yt-formatted-string > span:nth-child(3)');
if (! span) { return };
console.log(
drill.querySelector('a').toString()
+ " " + span.innerHTML
+ " " + drill.querySelector('a').text
);
});
Shorter snippet:
var output = '';
document.querySelectorAll("h3 > div > div > a").forEach((item) => { output += item.text + "\n"; });
console.log(output);
console.save(output);
Save data out of console via clipboard or writing a file (provides console.save()
command.
- naive-slack-scraper. Hypothetical code that cannot exist, as it potentially wouldn't follow terms of service. So don't look for it.
- pokemon-data. jq examples.
- moar jq examples
- If an archive of data is made, and that data cannot be viewed reasonably easily in a way similar to its original presentation by a person on the street, then it can be considered not to be viewable at all. It may as well not exist for public purposes. A possible retort is to assert "A viewer program could be built". But if that viewer program doesn't yet exist, then the data still can't be viewed. It's a Schroedinger's archive.