StumbleUpon extract tools & data

Fully processed StumbleUpon data extracted from the Wayback Machine, for an article.

What's in this repo?

/data-parsed/
- parsed-cleaned.csv: Final deduplicated extracted data.
- parsed.csv: Data before deduplication.
/data-raw/: Output of waybackpack, organised by timestamp and URL.
/samples/: Examples of the downloaded HTML, an individual StumbleUpon link, and the resulting CSV data.
/url-analysis/: The raw URLs from parsed-cleaned.csv, plus their status codes using vl.
clean_stumbleupon_metadata.py: Tool to deduplicate a CSV by id field (convert parsed.csv into parsed-cleaned.csv).
extract_stumbleupon_metadata.py: Tool to extract contents of downloaded StumbleUpon pages (convert data-raw contents into parsed.csv).
analyse_stumbleupon_metadata.py: Misc code to analyse the parsed data. This changes as required, full scripts available in original article.

To recreate the final output (parsed-cleaned.csv):

Install Python dependencies (pip install beautifulsoup4 lxml pandas)
Run Wayback Machine download script (waybackpack http://www.stumbleupon.com/discover/toprated/ -d "/Projects/StumbleUpon-extract/data-raw")
Run parsing script (python extract_stumbleupon_metadata.py)
Run deduping script (python clean_stumbleupon_metadata.py)

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
data-parsed		data-parsed
data-raw		data-raw
samples		samples
url-analysis		url-analysis
LICENSE		LICENSE
README.md		README.md
analyse_stumbleupon_metadata.py		analyse_stumbleupon_metadata.py
clean_stumbleupon_metadata.py		clean_stumbleupon_metadata.py
extract_stumbleupon_metadata.py		extract_stumbleupon_metadata.py