A concurrent go crawler.
- coop:
go get github.com/rakyll/coop
- boltdb:
go get github.com/boltdb/bolt
- progress-bar:
go get github.com/cheggaaa/pb
- go-command-line-tool:
go get github.com/codegangsta/cli
The crawler reads the alexa-1m website list and crawls the data and saves them in boltdb database. Using the cronjob we run the crawler daily and collect the data. Steps to run the crawler are given below.
- Load the sites to crawl from the .csv.gz file:
cd crawler/go/src/goalexa/ go run main.go load.go goalexa.go cache
- Start crawling, by defualt the crawler uses 100 parallel jobs but you can specify using -j JOBS (upto 256):
go run main.go load.go goalexa.go start -j 100