A web crawler frontier implementation in TypeScript backed by Redis. MicroFrontier is a scalable and distributed frontier implemented through Redis Queues.
- Fast Ingestion & High throughput
- Easy to use HTTP Microservice or Javascript Client
- Multiple configurable priority queues
- Customizable stochastic function for priority queue picking
- Politeness Policy: Per-Hostname crawl rate limit or default fallback delay
- Multi-processing & concurrency support
- Prioritization Strategy: Breadth-first Crawl, Depth-first crawl, PageRank etc... - TODO
- URL Re-visit policy - TODO
- URL canonicalization and Bloom filtering - TODO
- URL Selection Policy - TODO
MicroFrontier is inspired by the Mercator Frontier1
The frontier essentially answer a simple question: "What URL should i crawl next?". This seems a simple problem until you realize that you have to consider a lot of factors:
- That multiple crawlers should be able to work concurrently without overlapping
- You have to be polite with websites (DDoSing a website isn't fun)
- You have to visit a web page just once, or once in a while
- Some pages are more important than others to be crawled early on while others are just spider traps
Since I couldn't find a lightweight multipurpose frontier implementation, I made MicroFrontier, hoping that could help field researchers.
MicroFrontier can be used both as a Javascript library SDK, from the command line or with a Docker instance working as a microservice.
Install microfrontier with:
npm i -g microfrontier
Run microfrontier
microfrontier --host localhost --port 8090 --redis:host localhost --redis:port 6379
#see configuration section below for additional parameters
npm i microfrontier
# or
yarn add microfrontier
See below the examples for using the Javascript Client.
docker pull adileo/microfrontier
You can configure the docker instance with the environment variables described below.
ENV VAR | CLI PARAMS | Description |
---|---|---|
host | --host | Host name to start the microservice http server. Default value: 127.0.0.1 |
port | --port | Port to start the microservice http server. Default value: 8090 |
redis_host | --redis:host | Redis server host. Default value: 127.0.0.1 |
redis_port | --redis:port | Redis server port. Default value: 6379 |
redis_* | --redis:* | Parameters are interpreted by nconf and passed to ioredis as the client config. |
config_frontierName | --config:frontierName | Prefix used for Redis keys. |
config_* | --config:* | Parameters are interpreted by nconf , you can find an example of default values below. |
{
frontierName: 'frontier', // Example ENV: config_frontierName=frontier
priorities: { // Example ENV: config_priorities={"high":{"probability":0.6},...}
'high': {probability: 0.6},
'normal': {probability: 0.3},
'low': {probability: 0.1},
},
defaultCrawlDelay: 1000 // Example ENV: config_defaultCrawlDelay=1000
}
Via HTTP
curl --location --request POST 'http://127.0.0.1:8090/frontier' \
--header 'Content-Type: application/json' \
--data-raw '{
"url": "http://www.example.com",
"priority": "normal",
"meta": {
"foo": "bar"
}
}'
Via SDK
import { URLFrontier } from "microfrontier"
const frontier = new URLFrontier(config)
frontier.add("http://www.example.com", "normal", {"foo": "bar"}).then(() => {
console.log('URL added')
})
curl --location --request GET 'http://127.0.0.1:8090/frontier'
import { URLFrontier } from "microfrontier"
const frontier = new URLFrontier(config)
frontier.get().then((item) => {
// {url: "http://www.example.com", meta: {"foo":"bar"}}
})
Implemented, documentation WIP
Implemented, documentation WIP
Implemented, documentation WIP
[1]: High-Performance Web Crawling - Marc Najork, Allan Heydon