-
Notifications
You must be signed in to change notification settings - Fork 162
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make the errback handling method configurable #156
Conversation
The errback should never have been defaulted to the 'parse' method of the spider. By doing this it invalidates what the scrapy docs say. Also, there is no documentation on the scrapy site that says that exceptions get sent to the parse method. The reason this was found is because the error handling in the `process_spider_exception` middleware was never getting called as the scrapy docs said it should be. The workaround to get it to work the way it did before with the 'parse' method is add `&errback=parse` in the request.
This will allow the ability to change the non-standard behavior of sending exceptions to the `parse` method of the spider without introducing a breaking change to scrapyrt. It also introduces some documentation of the existing behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, and I agree that parse
should never have been the default, but I don’t see what this has to do with process_spider_exception
. That catches exceptions from a callback or an earlier spider middleware, while errback handles exceptions from the download handler and downloader middlewares, as far as I recall.
If your goal is to catch download exceptions, you might want to look into process_exception
from downloader middlewares instead of process_spider_exception
from spider middlewares.
scrapyrt/core.py
Outdated
|
||
if self.errback_name: | ||
errback = getattr(self.crawler.spider, self.errback_name) | ||
assert callable(errback), 'Invalid errback' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the problem with this here is that on bad parameters for errback it will fail silently on API side and write errors to logs if errback is not callable or spider has no "errback_name". Possibly we could handle this better by adding & validating errback on crawlerManager init and then wraping line 254 in run_crawl to throw 400 HTTP on some type of exceptions for example if error related to errback provided but not callable.
The line
getattr(self.crawler.spider, self.errback_name)
can also fail if errback is not an attribute of spider so in this case I would also try to throw 400 status code as this would be user error.
I'm not really sure if we should keep backward compat for behavior which seems to be an error. If parse should not be default errback then having it as default is not helping people. We can add warning to new release about this.
It also needs a unit test. So please add unit test.
Thank you for providing this fix. I'm happy to merge it and release it this week.
Here is a cleanroom project where you can test it out. Exception is thrown in the middleware |
Currently the application is not reporting to the user when the user provides an invalid errback or callback method. The scheduling of the request and validation of the spider callback and errback happens in a different thread than the one which is handling the api request. So, we need a different mechanism to communicate with the api request thread than simply raising the exception. We already do this for other errors and responses by adding properties to the CrawlManager object. So it seems best to also communicate this exception to the api request by using a user_error property on the CrawlManager. Then the exception can be raised in the context of the api request.
Co-authored-by: Adrián Chaves <adrian@chaves.io>
Co-authored-by: Adrián Chaves <adrian@chaves.io>
@pawelmhm I added some code to report user errors that occur in the spider_idle method to the api request, added a unit test, and modified existing unit tests. I also am not sure if we should keep backward compatibility for behavior which seems to be an error, but I'll let you make that call (we can just change the default to None instead of 'parse' in that case) Let me know what you think. |
thanks @radostyle let's go with default errback None, I'll check and release this today. |
I made changes to default to None here, added some more docs and also cleaned up error message a bit so that it doesn't return full traceback and logs info to file. #158 |
This patch makes the errback method that is called on the spider configurable.
By setting it to None in the configuration it allows the user to return scrapy to it's default exception handling.
Currently, in scrapyrt, there is the undocumented bug that exceptions are sent the 'parse' method.
Here is the commit where this was added:
75d2b3e
The errback should never have been defaulted to the 'parse' method of the spider. By doing this it invalidates what the scrapy docs say. Also, there is no documentation on the scrapy site that says that exceptions get sent to the parse method. The reason this was found is because the error handling in the
process_spider_exception
middleware was never getting called as the scrapy docs said it should be.The author was adding a feature so that one could pass the errback as a GET parameter as in the case for the "callback".
It seems they copy-pasted the "callback" line to get it to work, not realizing that 'parse' was a bad default for errback.
The correct default is to allow the exception to propagate to the existing scrapy functionality.
For backwards compatibility for anyone who relies on now sending exceptions to their 'parse' method, this patch keeps the bug, but adds some documentation, and allows users who want the unmodified scrapy exception handling to get it back.