This repository has been archived by the owner on Jun 30, 2023. It is now read-only.

Home

Jump to bottom

Aaron Taylor edited this page Jul 1, 2014 · 14 revisions

Scraping Resources

Data Mining

Web Crawling

Nutch Apache Web Crawler: http://nutch.apache.org/index.html
built on hadoop, very powerful and scalable
Setup with mySQL: http://nlp.solutions.asia/?p=362
API Documentation: http://nutch.apache.org/apidocs/apidocs-2.2.1/index.html
Crawling: http://wiki.apache.org/nutch/Nutch2Crawling
Data Mining Platform: http://www.slideshare.net/abial/nutch-as-a-web-data-mining-platform

Clustering

Carrot2 framework: http://project.carrot2.org/index.html
gathers search results into categories. could be used to find which pages within a institutional website contain calendaring information and can be marked for analysis

Content Analysis and Language Processing

Apache Tika: http://tika.apache.org
getting started: http://tika.apache.org/1.5/gettingstarted.html
GATE: http://gate.ac.uk
User Guide: http://gate.ac.uk/sale/tao/split.html
overview lecture: http://gate.ac.uk/sale/talks/gate-course-may11/track-3/module-11-machine-learning/module-11.pdf
built in information extraction system ANNIE
Jet (Java Extraction Toolkit): http://cs.nyu.edu/grishman/jet/license.html
language analysis tools, could be useful for phase 2 analysis

Neural Networking

OpenNN: http://www.intelnics.com/opennn
Wikipedia Page: http://en.wikipedia.org/wiki/OpenNN

Paid platform

Rapidminer: http://en.wikipedia.org/wiki/RapidMiner
LingPipe: http://alias-i.com/lingpipe/
Text processing

Machine Learning

Language-specific

Specific Applications

Scraping with Ruby

Intro to scraping: http://ruby.bastardsbook.com/chapters/web-scraping/
Nokogiri: http://ruby.bastardsbook.com/chapters/html-parsing/
Mechanize: http://readysteadycode.com/howto-scrape-websites-with-ruby-and-mechanize
Library Index: https://www.ruby-toolbox.com/categories/Web_Content_Scrapers
RSS feed scraper: https://github.com/feedjira/feedjira
Mechanize and Nokogiri Example: http://www.icicletech.com/blog/web-scraping-with-ruby-using-mechanize-and-nokogiri-gems
iCalendar
https://github.com/icalendar/icalendar
https://github.com/sam-github/vpim
https://github.com/rubyredrick/ri_cal

Scraping with Python

Third Party Solutions

Diffbot scraping
Custom APIs: http://www.diffbot.com/products/custom/
Crawlbot: http://www.diffbot.com/products/crawlbot/
contact about a custom Calendar API in line with their other services
Morph: https://morph.io
source code repo: https://github.com/openaustralia/morph
Crawlera: http://crawlera.com
Import.io: https://import.io
HN comment thread: https://news.ycombinator.com/item?id=7582858
AlchemyAPI: http://www.alchemyapi.com
text analysis and data gathering. may be beyond what we need for straight database input of events

Data Mining Algorithms

Top Algorithms used from UVM: http://www.cs.uvm.edu/~icdm/algorithms/10Algorithms-08.pdf
Everything Algorithm: http://gigaom.com/2014/05/23/meet-the-algorithm-that-can-learn-everything-about-anything/
Machine Learning: http://machinelearningmastery.com/a-tour-of-machine-learning-algorithms
Data mining Books: http://christonard.com/12-free-data-mining-books/
Machine Learning Basics Lectures: http://homepages.inf.ed.ac.uk/vlavrenk/iaml.html