Skip to content

A lightweight crawler frontier implementation in TypeScript using Redis.

License

Notifications You must be signed in to change notification settings

adileo/MicroFrontier

Repository files navigation

MicroFrontier · npm npm version Docker Pulls Docker Image Size (tag) License: GPL v3

A web crawler frontier implementation in TypeScript backed by Redis. MicroFrontier is a scalable and distributed frontier implemented through Redis Queues.

  • Fast Ingestion & High throughput
  • Easy to use HTTP Microservice or Javascript Client
  • Multiple configurable priority queues
  • Customizable stochastic function for priority queue picking
  • Politeness Policy: Per-Hostname crawl rate limit or default fallback delay
  • Multi-processing & concurrency support
  • Prioritization Strategy: Breadth-first Crawl, Depth-first crawl, PageRank etc... - TODO
  • URL Re-visit policy - TODO
  • URL canonicalization and Bloom filtering - TODO
  • URL Selection Policy - TODO

MicroFrontier is inspired by the Mercator Frontier1

Queue

Why you need MicroFrontier?

The frontier essentially answer a simple question: "What URL should i crawl next?". This seems a simple problem until you realize that you have to consider a lot of factors:

  • That multiple crawlers should be able to work concurrently without overlapping
  • You have to be polite with websites (DDoSing a website isn't fun)
  • You have to visit a web page just once, or once in a while
  • Some pages are more important than others to be crawled early on while others are just spider traps

Since I couldn't find a lightweight multipurpose frontier implementation, I made MicroFrontier, hoping that could help field researchers.

Usage

MicroFrontier can be used both as a Javascript library SDK, from the command line or with a Docker instance working as a microservice.

Command line usage

Install microfrontier with:

npm i -g microfrontier

Run microfrontier

microfrontier --host localhost --port 8090 --redis:host localhost --redis:port 6379
#see configuration section below for additional parameters

As a javascript library

npm i microfrontier

# or

yarn add microfrontier

See below the examples for using the Javascript Client.

Docker

docker pull adileo/microfrontier

You can configure the docker instance with the environment variables described below.

Configuration

ENV VAR CLI PARAMS Description
host --host Host name to start the microservice http server.
Default value: 127.0.0.1
port --port Port to start the microservice http server.
Default value: 8090
redis_host --redis:host Redis server host.
Default value: 127.0.0.1
redis_port --redis:port Redis server port.
Default value: 6379
redis_* --redis:* Parameters are interpreted by nconf and passed to ioredis as the client config.
config_frontierName --config:frontierName Prefix used for Redis keys.
config_* --config:* Parameters are interpreted by nconf, you can find an example of default values below.
{
    frontierName: 'frontier', // Example ENV: config_frontierName=frontier
    priorities: { // Example ENV: config_priorities={"high":{"probability":0.6},...}
        'high':     {probability: 0.6},
        'normal':   {probability: 0.3},
        'low':      {probability: 0.1},
    },
    defaultCrawlDelay: 1000 // Example ENV: config_defaultCrawlDelay=1000
}

How to

Adding an URL to the frontier

Via HTTP

curl --location --request POST 'http://127.0.0.1:8090/frontier' \
--header 'Content-Type: application/json' \
--data-raw '{
    "url": "http://www.example.com",
    "priority": "normal",
    "meta": {
        "foo": "bar"
    }
}'

Via SDK

import { URLFrontier } from "microfrontier"

const frontier = new URLFrontier(config)

frontier.add("http://www.example.com", "normal", {"foo": "bar"}).then(() => {
    console.log('URL added')
})

Getting an URL from the frontier

curl --location --request GET 'http://127.0.0.1:8090/frontier'
import { URLFrontier } from "microfrontier"

const frontier = new URLFrontier(config)

frontier.get().then((item) => {
    // {url: "http://www.example.com", meta: {"foo":"bar"}}
})

Per Hostname Rate-Limit

Implemented, documentation WIP

Scaling the frontend queue workers

Implemented, documentation WIP

Getting the number of enqueued urls (for an hostname)

Implemented, documentation WIP


Citations

[1]: High-Performance Web Crawling - Marc Najork, Allan Heydon