README

TRAWL4-alpha: A low memory footprint web crawler with a MySQL backend

Overview

This is a CLI tool for recursively crawling websites.

It:

discovers links and recursively follows them
adds crawled pages (URL, content) to a MySQL database for further processing

Features:

recursive, suitable for crawling/spidering an entire website/domain
respects the robots.txt standard https://en.wikipedia.org/wiki/Robots_exclusion_standard
holds an in-memory LRU for discovered links so the DB is not hit hard
auto-restarts session to avoid memory leaks
uses about 240MB RAM per typical crawl session

How do I get set up?

Clone this repository and follow these simple steps:

First, create an empty database (the crawler will create the tables automatically):

echo "CREATE DATABASE your_db_name" |mysql

Then modify lib/db/connect.js to suit your MySQL setup (user/password and database name)

Example: mysql://user:password@localhost/your_db_name

Now the obligatory

npm i

or

yarn

You're set!

Now run the crawler with:

npm run demo

You can hit Ctrl+C to stop crawling and wait about 2 seconds for the script to finish the exit routines.

Running npm run demo once more will resume the crawling.

The script will auto-restart itself every 100 URLs to work around a memory leak in cheerio

See lib/constants.js for more settings regarding crawl delay, in-memory LRU cache size, user agent and others

Be a good citizen

Please don't abuse the demo configuration, write your own (for example ./config/my_config.js) in and run it with

node runner.js  --preset my_config

Name		Name	Last commit message	Last commit date
Latest commit History 173 Commits
config		config
dist		dist
lib		lib
.babelrc		.babelrc
.eslintrc		.eslintrc
.gitignore		.gitignore
README.md		README.md
cli.js		cli.js
config.js		config.js
package.json		package.json
runner.js		runner.js
webpack.config.js		webpack.config.js
yarn.lock		yarn.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

README

TRAWL4-alpha: A low memory footprint web crawler with a MySQL backend

Overview

How do I get set up?

Be a good citizen

About

Releases

Packages

Languages

tudorilisoi/trawl4

Folders and files

Latest commit

History

Repository files navigation

README

TRAWL4-alpha: A low memory footprint web crawler with a MySQL backend

Overview

How do I get set up?

Be a good citizen

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages