mini-spider

A small crawler that can achieve breadth-first crawling of seed links.

The fetched web pages that meet the specific URL rules are saved in the output folder.

Run

python mini_spider.py -c spider.conf

Configuration file - spider.conf

[spider]
url_list_file: urls # Seed file path
output_directory: output # Capture result storage directory
max_depth: 1 # Maximum crawl depth (seed is level 0)
crawl_interval: 1 # crawl interval. Unit: seconds
crawl_timeout: 1 # crawl timeout. Unit: seconds
target_url: (https?|ftp|file)://[-A-Za-z0-9+&@#/%%?=~_|!:,.;]+[-A-Za-z0-9 +&@#/%%=~_|]
thread_count: 8 # Number of crawling threads

Seed file

Each line of the seed file has a link, for example:

Features

Support command line parameter processing. Specifically include: -h (help), -v (version), -c (configuration file).
The failure of crawling or parsing a single webpage will not cause the entire program to exit. Record the cause of the error in the log and continue.
When the program finishes all the crawling tasks, it exits gracefully.
Handling relative and absolute paths when extracting links from HTML.
Able to handle web pages with different character encodings (such as utf-8 or gbk).
When storing web pages, each web page is saved as a file separately, and the URL is the file name. There is an escape operation for special characters in the URL.
Supports multi-threaded parallel fetching.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
conf		conf
module		module
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
README_zh.md		README_zh.md
mini_spider.py		mini_spider.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

mini-spider

Run

Configuration file - spider.conf

Seed file

Features

About

Releases

Packages

Languages

License

wasPrime/mini-spider

Folders and files

Latest commit

History

Repository files navigation

mini-spider

Run

Configuration file - spider.conf

Seed file

Features

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages