This is a Python script that crawls a website and saves the text content of each page in a text file. It also extracts all the hyperlinks from each page and follows the links that are within the same domain to continue the crawling process.
Python 3.x
Works on Linux, Windows, macOS, BSD
Install dependencies:
pip install -r requirements.txt
To use this script, replace the domain
and full_url
variables with the domain and full URL of the website you want to crawl. Then, simply run the script in your Python environment.
The script will create a text directory in the same directory as the script, which will contain a directory for the domain being crawled and text files for each page crawled.
Note: It is recommended to use this script with permission from the authors of the websites.