Crawl sites and extract content via Python

These Python modules allow for basic site crawling and ngram extraction. Ngram analysis can help identify content gaps and a page's ability to rank for keywords, topics, and content contained with.

Site Crawling

The threaded crawler parses a starting page's HTML and then extracts all internal links from on-page anchors. Those URLs are added to a queue for crawling and link extraction. Individual crawl workers pull the next uncrawled URL from the queue, request the page's HTML, extract links and content, and then store the results.

URL and Domain Parsing

The threaded crawler contains a URL parser to match against sub-domains and other portions of URLs to determine if they should be crawled.

Redirects

Redirects are not automatically followed - instead each destination URL is added to the queue (if it's on the same domain) for processing later.

Ngram Extraction

The ngrammer uses common word boundaries (symbols, punctuations, linebreaks) and stop words to extract meaningful ngrams from source content (ngram length can be modified). Ngrams are generated without begining or ending stop words but can contain internal stop words. This can help identify complex or longer phrases.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
README.md		README.md
ngram-anything.py		ngram-anything.py
threaded-crawler.py		threaded-crawler.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Crawl sites and extract content via Python

Site Crawling

URL and Domain Parsing

Redirects

Ngram Extraction

About

Releases

Packages

Languages

myawesomebike/Text-Extraction-and-Processing

Folders and files

Latest commit

History

Repository files navigation

Crawl sites and extract content via Python

Site Crawling

URL and Domain Parsing

Redirects

Ngram Extraction

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages