WebCrawler

This is a Python script that crawls a website and saves the text content of each page in a text file. It also extracts all the hyperlinks from each page and follows the links that are within the same domain to continue the crawling process.

Requirements

Python 3.x
Works on Linux, Windows, macOS, BSD

Install

Install dependencies:

pip install -r requirements.txt

Usage

To use this script, replace the domain and full_url variables with the domain and full URL of the website you want to crawl. Then, simply run the script in your Python environment.

The script will create a text directory in the same directory as the script, which will contain a directory for the domain being crawled and text files for each page crawled.

Note: It is recommended to use this script with permission from the authors of the websites.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
README.md		README.md
crawler.py		crawler.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

WebCrawler

Requirements

Install

Usage

About

Releases

Packages

Languages

ksn-developer/webcrawler

Folders and files

Latest commit

History

Repository files navigation

WebCrawler

Requirements

Install

Usage

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages