Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added maximum idle waiting time MAX_IDLE_TIME_BEFORE_CLOSE. #193

Merged
merged 1 commit into from
May 8, 2021

Conversation

nieweiming
Copy link
Contributor

新增空闲最大等待时间MAX_IDLE_TIME_BEFORE_CLOSE.
在设置中使用MAX_IDLE_TIME_BEFORE_CLOSE来表示最大的等待秒数.
不设置或为0时,则会一直等待.
MAX_IDLE_TIME_BEFORE_CLOSE不会影响SCHEDULER_IDLE_BEFORE_CLOSE的使用.

Added maximum idle waiting time MAX_IDLE_TIME_BEFORE_CLOSE.
Use MAX_IDLE_TIME_BEFORE_CLOSE in the settings to indicate the maximum number of seconds to wait.
If it is not set or 0, it will wait forever.
MAX_IDLE_TIME_BEFORE_CLOSE will not affect the use of SCHEDULER_IDLE_BEFORE_CLOSE.

@rmax
Copy link
Owner

rmax commented Apr 26, 2021

Thanks for taking the time to send the PR.

How is this different from using SCHEDULER_IDLE_BEFORE_CLOSE setting? See https://github.com/rmax/scrapy-redis#usage

That feature uses a blocking redis operation to wait for the next request

block_pop_timeout = self.idle_before_close
request = self.queue.pop(block_pop_timeout)

@nieweiming
Copy link
Contributor Author

class RedisMixin(object):
    def setup_redis(self, crawler=None):
        ...
        self.server = connection.from_settings(crawler.settings)
        # The idle signal is called when the spider has no requests left,
        # that's when we will schedule new requests from redis queue
        crawler.signals.connect(self.spider_idle, signal=signals.spider_idle)

    def schedule_next_requests(self):
        """Schedules a request if available"""
        # TODO: While there is capacity, schedule a batch of redis requests.
        for req in self.next_requests():
            self.crawler.engine.crawl(req, spider=self)

    def spider_idle(self):
        """Schedules a request if available, otherwise waits."""
        # XXX: Handle a sentinel to close the spider.
        self.schedule_next_requests()
        raise DontCloseSpider

SCHEDULER_IDLE_BEFORE_CLOSE will not stop the crawler, because DontCloseSpider is always thrown,
So I hope that when the queue is idle for a period of time, it can end by itself.
The task is completed but in the running state, which means the occupation of resources;

Copy link
Owner

@rmax rmax left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution. Just a few points on time_ns() and non-english comments.

Could you add tests to ensure further changes do not break this feature?

This is a reference https://github.com/rmax/scrapy-redis/blob/master/tests/test_spiders.py#L113

src/scrapy_redis/spiders.py Outdated Show resolved Hide resolved
src/scrapy_redis/spiders.py Outdated Show resolved Hide resolved
@rmax
Copy link
Owner

rmax commented Apr 26, 2021

Oh, please update the readme too with this new setting 🚀

@nieweiming nieweiming reopened this May 8, 2021
@nieweiming nieweiming requested a review from rmax May 8, 2021 10:08
Copy link
Owner

@rmax rmax left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! Thanks 🙏🏽

@rmax rmax merged commit 6067555 into rmax:master May 8, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants