This repository has been archived by the owner on Jun 30, 2023. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 0
Home
Aaron Taylor edited this page Jul 1, 2014
·
14 revisions
- Nutch Apache Web Crawler: http://nutch.apache.org/index.html
- built on hadoop, very powerful and scalable
- Setup with mySQL: http://nlp.solutions.asia/?p=362
- API Documentation: http://nutch.apache.org/apidocs/apidocs-2.2.1/index.html
- Crawling: http://wiki.apache.org/nutch/Nutch2Crawling
- Data Mining Platform: http://www.slideshare.net/abial/nutch-as-a-web-data-mining-platform
- Carrot2 framework: http://project.carrot2.org/index.html
- gathers search results into categories. could be used to find which pages within a institutional website contain calendaring information and can be marked for analysis
- Apache Tika: http://tika.apache.org
- getting started: http://tika.apache.org/1.5/gettingstarted.html
- GATE: http://gate.ac.uk
- User Guide: http://gate.ac.uk/sale/tao/split.html
- overview lecture: http://gate.ac.uk/sale/talks/gate-course-may11/track-3/module-11-machine-learning/module-11.pdf
- built in information extraction system ANNIE
- Jet (Java Extraction Toolkit): http://cs.nyu.edu/grishman/jet/license.html
- language analysis tools, could be useful for phase 2 analysis
- OpenNN: http://www.intelnics.com/opennn
- Wikipedia Page: http://en.wikipedia.org/wiki/OpenNN
- Rapidminer: http://en.wikipedia.org/wiki/RapidMiner
- LingPipe: http://alias-i.com/lingpipe/
- Text processing
- Intro to scraping: http://ruby.bastardsbook.com/chapters/web-scraping/
- Nokogiri: http://ruby.bastardsbook.com/chapters/html-parsing/
- Mechanize: http://readysteadycode.com/howto-scrape-websites-with-ruby-and-mechanize
- Library Index: https://www.ruby-toolbox.com/categories/Web_Content_Scrapers
- RSS feed scraper: https://github.com/feedjira/feedjira
- Mechanize and Nokogiri Example: http://www.icicletech.com/blog/web-scraping-with-ruby-using-mechanize-and-nokogiri-gems
- iCalendar
- https://github.com/icalendar/icalendar
- https://github.com/sam-github/vpim
- https://github.com/rubyredrick/ri_cal
- Part 1: http://www.packtpub.com/article/web-scraping-with-python
- Part 2: http://www.packtpub.com/article/web-scraping-with-python-part-2
- Diffbot scraping
- Custom APIs: http://www.diffbot.com/products/custom/
- Crawlbot: http://www.diffbot.com/products/crawlbot/
- contact about a custom Calendar API in line with their other services
- Morph: https://morph.io
- source code repo: https://github.com/openaustralia/morph
- Crawlera: http://crawlera.com
- Import.io: https://import.io
- HN comment thread: https://news.ycombinator.com/item?id=7582858
- AlchemyAPI: http://www.alchemyapi.com
- text analysis and data gathering. may be beyond what we need for straight database input of events
- Top Algorithms used from UVM: http://www.cs.uvm.edu/~icdm/algorithms/10Algorithms-08.pdf
- Everything Algorithm: http://gigaom.com/2014/05/23/meet-the-algorithm-that-can-learn-everything-about-anything/
- Machine Learning: http://machinelearningmastery.com/a-tour-of-machine-learning-algorithms
- Data mining Books: http://christonard.com/12-free-data-mining-books/
- Machine Learning Basics Lectures: http://homepages.inf.ed.ac.uk/vlavrenk/iaml.html