WEB CRAWLER

This is my project which can crawl a given url or domain and collect the desired information from the website.

TECH STACK :

Python
WEB Libraries like Beautifulsoup , SELENIUM , REQUESTS , REGEX LIBRARY , SHUTIL , ARGPARSE, RANDOM, WEBDRIVER_MANAGER.CHROME, COLORAMA.

DETAILS :

Main Script consist of complete_web_crawler.py
Rest script are the parts of my program showing the functions that are used in the main script.

INSTALLING LIBRARIES :

USAGE :

git clone https://github.com/AyushAjay14/Web-Crawler.git

cd Web-Crawler

==>> python complete_web_crawler.py --url --depth --emails --headers --phoneno --imagelinks

--url <provide the desired URL >
--depth <provide the required depth>
--emails <if 1 is supplied then crawler will search for mails also and for 0 it will skip email scrapping >
--headers <if 1 is supplied then crawler will search for headers also and for 0 it will skip >
--phoneno <if 1 is supplied then crawler will search for phone numbers also and for 0 it will skip>
--imagelinks <if 1 is supplied then crawler will search for image links also and for 0 it will skip>
By default if any of the options is not provided then the option will be set to 1

EXAMPLE :

python complete_web_crawler.py --url https://ctftime.org --depth 1 --headers 1 --phoneno 1 --imagelinks 0

SCREENSHOT :

*** In order to make screenshot function work properly you need to install google chrome in your virtual machine
Commands are:

wget https://dl.google.com/linux/direct/google-chrome-stable_current_amd64.deb

sudo apt install ./google-chrome-stable_current_amd64.deb

Ensure it worked:

google-chrome --version

ERRORS :

If you are using wsl2 and getting the following error - Then follow the methods given here - https://www.gregbrisebois.com/posts/chromedriver-in-wsl2/

Name		Name	Last commit message	Last commit date
Latest commit History 76 Commits
complete_web_crawler.py		complete_web_crawler.py
depth_function.py		depth_function.py
error.png		error.png
links_images_emails.py		links_images_emails.py
phone_no_ss_headers.py		phone_no_ss_headers.py
readme.md		readme.md
running_the_functions.py		running_the_functions.py
saving_links_images_phoneno.py		saving_links_images_phoneno.py
snippet1.png		snippet1.png
snippet2.png		snippet2.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

WEB CRAWLER

TECH STACK :

DETAILS :

INSTALLING LIBRARIES :

USAGE :

git clone https://github.com/AyushAjay14/Web-Crawler.git

cd Web-Crawler

==>> python complete_web_crawler.py --url --depth --emails --headers --phoneno --imagelinks

EXAMPLE :

SCREENSHOT :

wget https://dl.google.com/linux/direct/google-chrome-stable_current_amd64.deb

sudo apt install ./google-chrome-stable_current_amd64.deb

Ensure it worked:

google-chrome --version

ERRORS :

SNIPPETS :

About

Releases

Packages

Languages

AyushAjay14/Web-Crawler

Folders and files

Latest commit

History

Repository files navigation

WEB CRAWLER

TECH STACK :

DETAILS :

INSTALLING LIBRARIES :

USAGE :

git clone https://github.com/AyushAjay14/Web-Crawler.git

cd Web-Crawler

==>> python complete_web_crawler.py --url --depth --emails --headers --phoneno --imagelinks

EXAMPLE :

SCREENSHOT :

wget https://dl.google.com/linux/direct/google-chrome-stable_current_amd64.deb

sudo apt install ./google-chrome-stable_current_amd64.deb

Ensure it worked:

google-chrome --version

ERRORS :

SNIPPETS :

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages