A robust and efficient web crawler written in Go (Golang). This project aims to provide a powerful tool for crawling and scraping web pages, collecting data such as page titles, links, descriptions, keywords, and response codes.
- Multithreaded crawling
- Customizable crawling depth
- Respect for
robots.txt
(URL filtering and crawling delay) - Configurable delay between requests
- Bulk saving of crawl results
- Export to JSON and CSV files
- and more...
You can use pre-built binary for your OS from the release.
To install from the source code, you need to have Go installed on your machine. If you don't have Go installed, you can download it from the official website.
-
Clone the repository:
git clone https://github.com/demyanovs/urlcrawler.git cd urlcrawler
-
Build the project:
go build -o urlcrawler main.go
This web crawler can be used directly from the command line after installation. It is configured through various flags that allow you to control its behavior.
The following are the primary command-line options available for the web crawler:
-u
(required): Specifies the starting URL for the crawler.-depth
: Sets the maximum depth of crawling relative to the starting URL. Default is0
(infinite).-delay
: Determines the delay between requests in milliseconds to manage load on the server. Default is1000
.-output
: Specifies the output format for the crawl results. Supported formats arecsv
andjson
. Default iscsv
.-output-file
: Specifies the file path to save the crawl results. Default isresults.csv
.-limit
: Specifies the maximum number of pages to crawl. Default is0
(unlimited).-timeout
: Specifies the maximum time in milliseconds to wait for a response. Default is5000
.-bulk-size
: Specifies the number of pages to save in each bulk write operation. Default is30
.-q
: quiet mode, suppresses all output except for errors. Default isfalse
.-ignore-robots
: Ignore robots.txt rules. Default isfalse
.-queue-len
: Specifies the number of parallel workers to use. Default is50
.
./urlcrawler -u=https://example.com
With depth 2 and limit of 10 URLs:
./urlcrawler -u=https://example.com -depth=2 -limit=10
For the help run:
./urlcrawler -h
Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.
Please make sure to update tests as appropriate.