Fully processed StumbleUpon data extracted from the Wayback Machine, for an article.
/data-parsed/
parsed-cleaned.csv
: Final deduplicated extracted data.parsed.csv
: Data before deduplication.
/data-raw/
: Output ofwaybackpack
, organised by timestamp and URL./samples/
: Examples of the downloaded HTML, an individual StumbleUpon link, and the resulting CSV data./url-analysis/
: The raw URLs fromparsed-cleaned.csv
, plus their status codes usingvl
.clean_stumbleupon_metadata.py
: Tool to deduplicate a CSV byid
field (convertparsed.csv
intoparsed-cleaned.csv
).extract_stumbleupon_metadata.py
: Tool to extract contents of downloaded StumbleUpon pages (convertdata-raw
contents intoparsed.csv
).analyse_stumbleupon_metadata.py
: Misc code to analyse the parsed data. This changes as required, full scripts available in original article.
To recreate the final output (parsed-cleaned.csv
):
- Install Python dependencies (
pip install beautifulsoup4 lxml pandas
) - Run Wayback Machine download script (
waybackpack http://www.stumbleupon.com/discover/toprated/ -d "/Projects/StumbleUpon-extract/data-raw"
) - Run parsing script (
python extract_stumbleupon_metadata.py
) - Run deduping script (
python clean_stumbleupon_metadata.py
)