Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retry temporary failures for Bigtable scans #2255

Closed
sduskis opened this issue Sep 7, 2016 · 15 comments
Closed

Retry temporary failures for Bigtable scans #2255

sduskis opened this issue Sep 7, 2016 · 15 comments
Assignees
Labels
api: bigtable Issues related to the Bigtable API. priority: p2 Moderately-important priority. Fix may not be included in next release. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design.

Comments

@sduskis
Copy link
Contributor

sduskis commented Sep 7, 2016

Cloud Bigtable can return DEADLINE_EXCEEDED after 5 minutes on a scan request, even if there are more responses to be returned. In that case, or INTERNAL, UNAVAILABLE or ABORTED, the scan should be retried from the key after the one previously returned in the scan.

@dhermes dhermes added the api: bigtable Issues related to the Bigtable API. label Sep 7, 2016
@dhermes
Copy link
Contributor

dhermes commented Sep 7, 2016

@sduskis You mean during ReadRows, instead of a ReadRowsResponse popping off the stream, the client encounters one of the errors listed:

  1. DEADLINE_EXCEEDED
  2. ABORTED
  3. INTERNAL
  4. UNAVAILABLE

So we could limit the change to the place we consume the next item from the stream?

@sduskis
Copy link
Contributor Author

sduskis commented Sep 7, 2016

Taking a step back, these 4 errors can be recovered in most RPC calls. MutateRow, MutateRows, SampleRowKeys, checkAndMutate all can be retried, but have slightly different nuances. That said, ReadRows was the first reported issue where retries are most likely to fix the problem.

In terms of ReadRows,I think you are correct in terms of where the code fix out to be.

@garye
Copy link
Contributor

garye commented Mar 24, 2017

@dhermes @tseaver, @calpeyser will be taking a look at implementing this.

Note that it's not enough to just resend the same request after a failure, we need to invoke ReadRows with a start_key that is immediately after the last row key we have successfully read. It looks like this will require some refactoring since PartialRowsData doesn't actually invoke the API call curently.

@calpeyser
Copy link
Contributor

calpeyser commented Mar 29, 2017

Hey guys - I've had a look, here's a proposal:

(1) Implementing Retries - we can probably do this by combining "retryable" in gax-python with the "response iterator" convention currently used in table#read_rows.

Basically, we can wrap the GRPC stream in a callable class that does the following:

  • Keeps track of a set of row keys that still need to be read as state in the class
  • When the class is called, pop an element off the stream.
  • If that element is a valid row, remove that row from the list of keys to be read, and return it.
  • If that element is a GRPC error, propagate the Exception up the call stack.

We then take the GRPC wrapper, and wrap it again in a gax-python retryable. This way, if an idempotent Exception is propagated up from a call, the call will be retried.

(2) Testing - we need some new functionality over the current unit testing infrastructure, since those tests make assertions about the RPCs made by the client, and do not take responses into account. We'll need to add an integration testing framework that will spin up a mock server and make read_rows calls against it.

@garye's test script should allow us to model errors sent by the server and the expectation of retries. So, we'll need to add

  • A mock server. I have access to a go dummy, but how do we ensure developers can run tests on all platforms? e.g. if we publish a Linux binary, Mac developers won't be able to run the unit tests. Do I need to re-implement the server in Python?
  • Python tests that spin up a Client instance connecting the server. This is tricky because the Client API does not seem to support custom configuration values which we need to allow the client to run without credentials, as well as to provide a host and port that points at the locally-running server. That is, we need the moral equivalent of BigtableOptionsFactory.

@garye
Copy link
Contributor

garye commented Mar 30, 2017

(1) With the current interface that only takes a row range, you don't know the row keys that still need to be read. You do know the last key that was read before an error (they come back in lexicographical order), which tells you which row range you need to send in the retried request.

(2) I don't think re-implementing the mock server is Python is a great use of time. We can easily build linux, mac and windows binaries of the mock server and do something like pull one down from a public GCS bucket when the systests run depending on the current platform. I think the systests already install the bigtable emulator via gcloud which is along the same lines.

The shortcut here is to use the BIGTABLE_EMULATOR_HOST environment variable as described here. The python library has support for it and should solve this problem for us. The mock retry server is essentially the emulator with different guts.

@lukesneeringer lukesneeringer added priority: p2 Moderately-important priority. Fix may not be included in next release. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design. labels Mar 31, 2017
@calpeyser
Copy link
Contributor

I gave the above a whirl - please see #3279

@lukesneeringer
Copy link
Contributor

Hello,
One of the challenges of maintaining a large open source project is that sometimes, you can bite off more than you can chew. As the lead maintainer of google-cloud-python, I can definitely say that I have let the issues here pile up.

As part of trying to get things under control (as well as to empower us to provide better customer service in the future), I am declaring a "bankruptcy" of sorts on many of the old issues, especially those likely to have been addressed or made obsolete by more recent updates.

My goal is to close stale issues whose relevance or solution is no longer immediately evident, and which appear to be of lower importance. I believe in good faith that this is one of those issues, but I am scanning quickly and may occasionally be wrong. If this is an issue of high importance, please comment here and we will reconsider. If this is an issue whose solution is trivial, please consider providing a pull request.

Thank you!

@mbrukman
Copy link
Contributor

This issue should be re-opened; this is a critical feature for the Bigtable client library that needs to be tracked until it is completed.

My understanding of the history of this issue: PR #3279 was the first attempt, which was superseded by PR #3324, which was submitted, but then reverted in PR #3642 so this has not yet been merged. There's a branch pr-3324-to-fix that was kept around with the relevant changes that needs to be updated to pass the tests, and then it can be attempted to re-merge.

Please re-open this issue so that we can track its progress.

@chemelnucfin
Copy link
Contributor

@jonparrott @zakons Do you know if retry and the recent work in bigtable would have fixed this? If not, that's ok, I'll follow up with this later.
Thanks.

@chemelnucfin
Copy link
Contributor

chemelnucfin commented Jan 22, 2018

Hello, feature requests will now be tracked in the project Feature Requests. I will close this issue now, please feel free to continue to address any issues/concerns here.

@alugowski
Copy link

Can someone with permissions please add this to that Feature Requests page?

This issue means that nearly all online examples on how to iterate over Bigtable are misleading; a simple for in loop will unexpectedly crash at some point. The default behavior of iteration should be somewhat retry tolerant.

@tseaver
Copy link
Contributor

tseaver commented Jul 17, 2018

@alugowski, @sduskis Since the merge of PR #5178, the new low-level GAPIC clients already handle retries, based on the relevant *_client_config modules:

@sduskis
Copy link
Contributor Author

sduskis commented Jul 17, 2018

The low level GAPIC retries don't work for streaming RPCs like Cloud Bigtable Read Rows. Default retries just retry the original request until it completes. Retries for streaming RPCs need to pick up from where they previously left off, which requires custom logic to modify the request. We actually put in a fix for this 4 months ago with PR #4882.

@tseaver
Copy link
Contributor

tseaver commented Jul 17, 2018

@sduskis OK, looks like #4882 is not part of a release, so it will be included in the one we are driving toward this month.

@alugowski
Copy link

Thanks!

Sounds like this kludge will no longer be necessary:

while True:
    try:
        for row in table.yield_rows(start_key=start_key):
            row_key = row.row_key.decode('utf-8')
            yield (...)

            start_key = row_key  # if iteration raises, we will restart again here
        break
    except Exception as e:
        logging.warning("Exception while iterating over bigtable. Will continue retrying. {}".format(e))

When the iteration takes on the order of hours then the logs are full of this:

<_Rendezvous of RPC that terminated with:
	status = StatusCode.DEADLINE_EXCEEDED
	details = "Error while reading table 'blahblahblah'"
	debug_error_string = "{"created":"@1531838276.260045000","description":"Error received from peer","file":"src/core/lib/surface/call.cc","file_line":1083,"grpc_message":"Error while reading table 'blahblahblah'","grpc_status":4}"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigtable Issues related to the Bigtable API. priority: p2 Moderately-important priority. Fix may not be included in next release. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design.
Projects
None yet
Development

No branches or pull requests

9 participants