-
scrapy
Scrapy is a fast high-level screen scraping and web crawling framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing.
Project Source: https://github.com/scrapy/scrapy
Project Homepage: http://scrapy.org/ -
Pattern
Web mining module for Python, with tools for scraping, natural language processing, machine learning, network analysis and visualization.
Project Source: https://github.com/clips/pattern
Project Homepage: http://www.clips.ua.ac.be/pages/pattern -
portia
Portia is a tool for visually scraping web sites without any programming knowledge.
Project Source: https://github.com/scrapinghub/portia -
python-goose
Html Content / Article Extractor, web scrapping lib in Python.
Project Source: https://github.com/grangier/python-goose -
newspaper
News extraction, article extraction and content curation in python.
Project Source: https://github.com/codelucas/newspaper
Project Homepage: http://newspaper.readthedocs.org/en/latest/ -
gensim
Gensim is a Python library for topic modelling, document indexing and similarity retrieval with large corpora.
Project Source: https://github.com/piskvorky/gensim
Project Homepage: http://radimrehurek.com/gensim/ -
distribute_crawler
A distributed web crawler.
Project Source: https://github.com/gnemoug/distribute_crawler -
pyspider
A spider system in python.
Project Source: https://github.com/binux/pyspider -
tagger
A Python module for extracting relevant tags from text documents.
Project Source: https://github.com/apresta/tagger -
cola
A distributed crawling framework.
Project Source: https://github.com/chineking/cola