Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Provide option to retry request on rateLimitExceeded errors #809

Closed
jimfulton opened this issue Jul 25, 2021 · 13 comments
Closed

feat: Provide option to retry request on rateLimitExceeded errors #809

jimfulton opened this issue Jul 25, 2021 · 13 comments
Assignees
Labels
api: bigquery Issues related to the googleapis/python-bigquery API. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design.

Comments

@jimfulton
Copy link
Contributor

This is mainly needed for sqlalchemy-bigquery compliance tests, but I can imagine it being useful for other use cases. :)

SQLAlchemy dialect compliance tests create tables at the beginning of a suite of tests and reuse them for the suite.

If a test fails due to rateLimitExceeded errors, the tables are left in an unexpected state. Retrying the test may not work.
There's an apparent option to recreate tables for each test, but it's not implemented correctly. I could probably fix it for current SQLAlchemy releases (1.4), but I'd probably have to monkey-patch 1.3.

I propose 2 new query-job config settings:

  • retry_on_rate_limit_exceeded: int, maximum retries when the client gets an "Exceeded rate limits" error, defaulting to 0
  • retry_delay_on_rate_limit_exceeded: int, number of seconds to wait before retrying when the client gets an "Exceeded rate limits" error, defaulting to 60.
@product-auto-label product-auto-label bot added the api: bigquery Issues related to the googleapis/python-bigquery API. label Jul 25, 2021
@jimfulton jimfulton self-assigned this Jul 25, 2021
@jimfulton
Copy link
Contributor Author

@tswast Are you good with this?

@tswast
Copy link
Contributor

tswast commented Jul 26, 2021

I think this would have to be done at DB-API layer -- and I'd actually like to retry queries by default there. Although maybe it could work on query job.

It's a bit trickier with query job since we can't retry if they've set a job ID. Also, we'd need to track whether they've explicitly set a destination table or if the destination table was set by the API (in which case we need to unset it before retrying the query)

@tswast
Copy link
Contributor

tswast commented Jul 26, 2021

I'd actually expect to use a retry object to configure this from API core library. Obviously, it'd have to be a different predicate than our normal "retry this API request" defaults.

@tswast tswast added the type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design. label Jul 26, 2021
@tswast
Copy link
Contributor

tswast commented Jul 26, 2021

As discussed offline, we can make the query job save the original request so that we don't have to worry about anything except for resetting the job ID before retrying.

@tswast
Copy link
Contributor

tswast commented Jul 26, 2021

Closing in favor of #539

@jimfulton
Copy link
Contributor Author

Reopened because the scope here is a little different and I'd like to discuss without spamming Jake Summers.

@jimfulton
Copy link
Contributor Author

I want to bring up a couple of issues.

Lets start with result.

@tswast is your thought that result would retry by reexecuting the original request if there's an error?

@tswast
Copy link
Contributor

tswast commented Jul 27, 2021

@tswast is your thought that result would retry by reexecuting the original request if there's an error?

Yes, I think that's the way it'd have to work for at least some jobs.

Though as you found out, it seems we can do some detection of failed by but retryable jobs even as early as the call to begin. Maybe we should start there, since it's less surprising to re-issue a job right away.

With result(), the job could have been created quite a while ago. For example, if the job class was created from a call to get_job(), I wouldn't expect result() to re-issue the job.

@jimfulton
Copy link
Contributor Author

Though as you found out, it seems we can do some detection of failed by but retryable jobs even as early as the call to begin. Maybe we should start there, since it's less surprising to re-issue a job right away.

With begin we can retry at the API level. That's what the current code does when an API call raises an exception. In the sqlalchemy tests, this is where most of the errors seem to happen, which is good.

We want retry at a higher level if we get an error from get_job within the client query method, or when calling query job result.
In these cases, we want to avoid retrying at the API level, as it's a waste of time.

@jimfulton
Copy link
Contributor Author

Next, retry predicates. Retry is based on exceptions, but BQ client api calls often don't use exceptions to track errors.

This makes predicates hard to express. You either need a new kind of Retry with a predicate that's applied to non-error results, or you need a flag to cause error responses (status code 200, but with an error status in the response data) to be turned into exceptions.

As I discovered in our call today, we'd want to retry at the API level for begin, but not for result because we don't want to retry API calls on failed jobs.

Maybe (thinking out loud):

  • We want Client._call_api to always raise exceptions for error responses (with 200 status codes), and have the retry machinery retry them using the retry passed to query but
  • After begin, we don't want to pass user retry objects to get_job and PollingFuture.result. If we use retries for those, the predicates used should be narrow so as to not retry api calls for failed jobs. Rather, we should apply user-provided retry (and our sensible-default retry) at the level of Client.query and QueryJob.result.

@jimfulton
Copy link
Contributor Author

@tswast I'm confused about whether API requests should be retried after begin. I think I heard you and Seth say that any error responses (200 responses with a json body with an errorResult status) indicate a failed job and not a failed API request. Is that right?

In fact, maybe there's a deeper semantic I'm missing about these kind of responses.

@tswast
Copy link
Contributor

tswast commented Jul 27, 2021

That's correct. The job was created, so retrying exactly the same request will fail. But retrying with a fresh job ID might succeed.

@tswast
Copy link
Contributor

tswast commented Jul 27, 2021

We want Client._call_api to always raise exceptions for error responses (with 200 status codes), and have the retry machinery retry them using the retry passed to query but

That's unlikely to work because of re-used job IDs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquery Issues related to the googleapis/python-bigquery API. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design.
Projects
None yet
Development

No branches or pull requests

2 participants