Scrapy-Redis

Redis-based components for Scrapy.

Usage: https://github.com/rmax/scrapy-redis/wiki/Usage
Documentation: https://github.com/rmax/scrapy-redis/wiki.
Release: https://github.com/rmax/scrapy-redis/wiki/History
Contribution: https://github.com/rmax/scrapy-redis/wiki/Getting-Started
LICENSE: MIT license

Features

Distributed crawling/scraping

You can start multiple spider instances that share a single redis queue. Best suitable for broad multi-domain crawls.
Distributed post-processing

Scraped items gets pushed into a redis queued meaning that you can start as many as needed post-processing processes sharing the items queue.
Scrapy plug-and-play components

Scheduler + Duplication Filter, Item Pipeline, Base Spiders.
In this forked version: added json supported data in Redis
data contains url, `meta` and other optional parameters. meta is a nested json which contains sub-data. this function extract this data and send another FormRequest with url, meta and addition formdata.

For example:
```
{ "url": "https://exaple.com", "meta": {"job-id":"123xsd", "start-date":"dd/mm/yy"}, "url_cookie_key":"fertxsas" }
```
this data can be accessed in scrapy spider through response. like: request.url, request.meta, request.cookies

Note

This features cover the basic case of distributing the workload across multiple workers. If you need more features like URL expiration, advanced URL prioritization, etc., we suggest you to take a look at the Frontera project.

Requirements

Python 3.7+
Redis >= 5.0
Scrapy >= 2.0
redis-py >= 4.0

Installation

From pip

pip install scrapy-redis

From GitHub

git clone https://github.com/darkrho/scrapy-redis.git
cd scrapy-redis
python setup.py install

Note

For using this json supported data feature, please make sure you have not installed the scrapy-redis through pip. If you already did it, you first uninstall that one.

pip uninstall scrapy-redis

Alternative Choice

Frontera is a web crawling framework consisting of crawl frontier, and distribution/scaling primitives, allowing to build a large scale online web crawler.

Name		Name	Last commit message	Last commit date
Latest commit History 246 Commits
.github		.github
docs		docs
example-project		example-project
src/scrapy_redis		src/scrapy_redis
tests		tests
.bandit.yml		.bandit.yml
.bumpversion.cfg		.bumpversion.cfg
.cookiecutterrc		.cookiecutterrc
.coveragerc		.coveragerc
.dockerignore		.dockerignore
.editorconfig		.editorconfig
.flake8		.flake8
.gitattributes		.gitattributes
.gitignore		.gitignore
.isort.cfg		.isort.cfg
.pre-commit-config.yaml		.pre-commit-config.yaml
.python-version		.python-version
.readthedocs.yml		.readthedocs.yml
.travis.yml		.travis.yml
AUTHORS.rst		AUTHORS.rst
CONTRIBUTING.rst		CONTRIBUTING.rst
Dockerfile		Dockerfile
HISTORY.rst		HISTORY.rst
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
Makefile		Makefile
README.rst		README.rst
TODO.rst		TODO.rst
VERSION		VERSION
coverage.xml		coverage.xml
docker-compose.yaml		docker-compose.yaml
pylintrc		pylintrc
pytest.ini		pytest.ini
requirements-tests.txt		requirements-tests.txt
requirements.txt		requirements.txt
setup.cfg		setup.cfg
setup.py		setup.py
tox.ini		tox.ini

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Scrapy-Redis

Features

Requirements

Installation

Alternative Choice

About

Releases 7

Packages

Contributors 33

Languages

License

rmax/scrapy-redis

Folders and files

Latest commit

History

Repository files navigation

Scrapy-Redis

Features

Requirements

Installation

Alternative Choice

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 7

Packages 0

Contributors 33

Languages

Packages