awesome-data-hoarding

A concise cheat-sheet of commands and tools for scraping, saving, hoarding, archiving, collecting, organising and browsing data.

Inspired by Reddit's /r/DataHoarder

Quick reference

Which archiving tool should you choose for each web service?

Amazon Video: Unknown. Check torrents instead.
BBC iPlayer: youtube-dl / yt-dlp
Discord: DiscordChatExporter (see below for notes)
Mediawiki website: Native dump using /wiki/Special:AllPages and /wiki/Special:Export.
Netflix: Unknown. Check torrents instead.
Reddit: Various tools
- Tools to save whole threads
- "Print" method for threads
  - Change www.reddit.com to old.reddit.com -- all comments will now be expanded
  - Sort by: New
  - Use cleanly print chrome extension
  - Click to remove areas, also click to 'tag' areas for printing.
- Historial data dumps: the-eye / torrents
SoundCloud: youtube-dl / yt-dlp
Tumblr: TumblThreeApp (Windows). Viewers: 1, 2.
Twitter: ThreadReaderApp
Torrents: Use unblockit for a list of torrent sites. Official Twitter / Reddit.
Private torrent trackers: Might contain any TV or movie ever broadcat. It can be difficult to get an invite, and you may need to maintain an upload ratio.
Individual web pages:
- Save as | Web Page, HTML Only
- Save as | Web Page, Single File
- Save as | Web Page, Complete
- Print | Save as PDF
- Chrome extension SingleFile <-- Recommended!
Websites generally: wget, httrack or ArchiveBot.
Youtube video/music: youtube-dl (see below for notes) / yt-dlp
Radio scrobbling / Music identification: Shazam or AHA Music finder

Scraping tools

Radio scrobbling
- Play radio station with low quality playlist: La Mega, Malaga.
- Install chrmoe browser extension Shazam or AHA Music finder
- On Linux use xdotool to automate clicking on chrome browser extension icons to activate music identification: watch "xdotool mousemove 3442 90 click 1; sleep 20; xdotool mousemove 3476 90 click 1; sleep 20" (adjust coords as needed)
- Does not require speakers to be on

Details of precise sets of commands.

wget for websites

wget \
    -e 'robots=off' \
    --accept '*.*' \
    --mirror \
    --wait 2 \
    --random-wait \
    --convert-links \
    --user-agent 'Mozilla/5.0 (Windows NT 10.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.7113.93 Safari/537.36' \
    'http://www.example.com/'

StreamRipper for music
- Example: streamripper ###URL### -u "FreeAmp/2.x" -q -l 86400
Chrome DevTools for anything via a web browser
- network tab
- resources tab
Mediawiki for wiki sites
- For an XML dump containing wikitext...
- Copy names of pages from /wiki/Special:AllPages...
- Paste into /wiki/Special:Export
- (optional) Parse resulting wikitext with mwparserfromhell.
youtube-dl / yt-dlp for Youtube and other video/audio
- Video

yt-dlp \
    --ignore-errors \
    --format 'bestvideo[ext=mp4]+bestaudio[ext=m4a]/best[ext=mp4]/best' \
    --output "%(playlist_title)s/%(title)s.%(ext)s" \
    --throttled-rate 10K \
    ###URL###

Audio

yt-dlp \
    --ignore-errors \
    --extract-audio \
    --audio-quality 0 \
    --audio-format mp3 \
    --prefer-ffmpeg \
    --output "%(playlist_title)s/%(artist)s - %(title)s.%(ext)s" \
    --throttled-rate 10K \
    ###URL###

Audio album playlist

yt-dlp \
    ...etc... \
    --output "%(artist)s - %(album)s/%(artist)s - %(album)s - %(playlist_index)02d - %(track)s.%(ext)s" \
    ###URL###

Video playlist

yt-dlp \
    ...etc... \
    --output "%(playlist_title)s/%(playlist_index)03d - %(artist)s - %(title)s.%(ext)s" \
    ###URL###

Multiple playlists

for URL in $(cat list)
do
    yt-dlp ...etc... "$URL"
done

DiscordChatExporter + excellent wiki
- Example: docker run --rm -v /var/www/zaphod/adhd:/app/out tyrrrz/discordchatexporter:stable export --channel ###ID### --token ###SECRET### --format Json
- List guilds: docker run tyrrrz/discordchatexporter:stable guilds
- List channels: docker run tyrrrz/discordchatexporter:stable channels --guild ###ID###

Processing tools

jq
- Example: jq -j -M --stream -f discord1.jq discord1.jq
XPath Helper
- Example: Ctrl-Shift-X (or Command-Shift-X on Mac)
HAR recorders (if for some reason Chrome's "Save as HAR" feature isn't sufficient)
- AutoHAR
- HAR Recorder
HAR extractors (to retrieve the original content from inside a HAR file)

Techniques

Combine streamed .ts files and m3u8 playlist/chunklist into an mpeg/mp4 video

After extracting the .m4u8 and .ts files from HAR, run something like:
- ffmpeg -i playlist.m3u8 -c copy -bsf:a aac_adtstoasc output.mp4

Extract playlist data from YouTube and YT Music

Input: https://music.youtube.com/library/playlists Goal: Extract a list of playlists suitable for feeding to youtube-dl / yt-dlp

These are all equivalent ways to achieve the same thing:

Chrome: Save As | Web Page, HTML Only --> doesn't work, empty page
Chrome: Save As | Web page, Single File --> works, full HTML, embeds images, uses "quoted printable encoding", i.e. = becomes =3D
Chrome: Save As | Web page, Complete --> works, full HTML, not encoded, saves album/playlist covers as image files.
Chrome: DevTools | Elements | | right-click | Copy | Copy element | Paste into text editor --> works, full HTML
Chrome: Extensions | XPath Helper | Ctrl-Shift-X | Hover over element | Shift | Edit XPath to remove e.g. [409] | Append /@href --> works, list of URLs
Chrome: DevTools | Console | | | $(document).xpathEvaluate('//body/div/foo')
Chrome: DevTools | Elements | right-click | Copy | Copy JS | (paste into console and edit - see snippet below)
Chrome: Extensions | AutoHAR | chrome --auto-open-devtools-for-tabs | ...etc
Chrome: DevTools | Network | Filter | Fetch/XHR | https://music.youtube.com/youtubei/v1/browse/...etc... | (a) Save all as HAR with content, (b) (down-arrow near top-right) Export HAR...
(Idea) Headless chrome + puppeteer or playwright

Javascript snippet:

items = document.querySelectorAll("#items > ytmusic-two-row-item-renderer");
items.forEach((item) => {
    drill = item.querySelector("div.details.style-scope.ytmusic-two-row-item-renderer");
    span = drill.querySelector('span > yt-formatted-string > span:nth-child(3)');
    if (! span) { return };
    console.log(
        drill.querySelector('a').toString()
        + "    " + span.innerHTML
        + "    " + drill.querySelector('a').text
    );
});

Shorter snippet:

var output = '';
document.querySelectorAll("h3 > div > div > a").forEach((item) => { output += item.text + "\n"; });
console.log(output);
console.save(output);

Save data out of console via clipboard or writing a file (provides console.save() command.

Case studies

naive-slack-scraper. Hypothetical code that cannot exist, as it potentially wouldn't follow terms of service. So don't look for it.
pokemon-data. jq examples.
moar jq examples

Discussion

If an archive of data is made, and that data cannot be viewed reasonably easily in a way similar to its original presentation by a person on the street, then it can be considered not to be viewable at all. It may as well not exist for public purposes. A possible retort is to assert "A viewer program could be built". But if that viewer program doesn't yet exist, then the data still can't be viewed. It's a Schroedinger's archive.

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

awesome-data-hoarding

Quick reference

Scraping tools

Processing tools

Techniques

Combine streamed .ts files and m3u8 playlist/chunklist into an mpeg/mp4 video

Extract playlist data from YouTube and YT Music

Case studies

Discussion

Communities

Similar projects

About

License

all-the-data/awesome-data-hoarding

Folders and files

Latest commit

History

Repository files navigation

awesome-data-hoarding

Quick reference

Scraping tools

Processing tools

Techniques

Combine streamed .ts files and m3u8 playlist/chunklist into an mpeg/mp4 video

Extract playlist data from YouTube and YT Music

Case studies

Discussion

Communities

Similar projects

About

Topics

Resources

License

Stars

Watchers

Forks