SpiderSel provides the following features:
- Crawling of HTTP and HTTPS websites for keywords via Selenium (native JS support)
- Spidering of new URLs found within source code (adjustable depth, stays samesite)
- Filtering keywords by length and removing non-sense (paths, emails, protocol handlers etc.)
- Storing keywords and ignored strings into a separate results directory (txt files)
Basically alike to CeWL or CeWLeR but with support for websites that require JavaScript.
usage: spidersel.py [-h] --url URL [--depth DEPTH] [--min-length MIN_LENGTH]
Web Crawler and Keyword Extractor
options:
-h, --help show this help message and exit
--url URL URL of the website to crawl
--depth DEPTH Depth of subpage spidering (default: 1)
--min-length MIN_LENGTH Minimum keyword length (default: 4)
--lowercase Convert all keywords to lowercase
--include-emails Include emails as keywords
docker run -v ${PWD}:/app/results --rm l4rm4nd/spidersel:latest --url https://www.apple.com --lowercase --include-emails
You will find your scan results in the current directory.
If you don't trust my image on Dockerhub, please go ahead and build the image yourself:
git clone https://github.com/Haxxnet/SpiderSel && cd SpiderSel
docker build -t spidersel .
docker run -v ${PWD}:/app/results --rm spidersel --url https:/www.apple.com --lowercase --include-emails
# clone repository and change directory
git clone https://github.com/Haxxnet/SpiderSel && cd SpiderSel
# optionally install google-chrome if not available yet
wget https://dl.google.com/linux/direct/google-chrome-stable_current_amd64.deb
sudo dpkg -i google-chrome-stable_current_amd64.deb
# install python dependencies; optionally use a virtual environment (e.g. virtualenv, pipenv, etc.)
pip3 install -r requirements.txt
python3 spidersel.py --url https://www.apple.com/ --lowercase --include-emails
The extracted keywords will be stored in an output file within the results folder.