Retry temporary failures for Bigtable scans #2255

sduskis · 2016-09-07T12:41:23Z

Cloud Bigtable can return DEADLINE_EXCEEDED after 5 minutes on a scan request, even if there are more responses to be returned. In that case, or INTERNAL, UNAVAILABLE or ABORTED, the scan should be retried from the key after the one previously returned in the scan.

dhermes · 2016-09-07T14:19:58Z

@sduskis You mean during ReadRows, instead of a ReadRowsResponse popping off the stream, the client encounters one of the errors listed:

So we could limit the change to the place we consume the next item from the stream?

sduskis · 2016-09-07T18:15:41Z

Taking a step back, these 4 errors can be recovered in most RPC calls. MutateRow, MutateRows, SampleRowKeys, checkAndMutate all can be retried, but have slightly different nuances. That said, ReadRows was the first reported issue where retries are most likely to fix the problem.

In terms of ReadRows,I think you are correct in terms of where the code fix out to be.

garye · 2017-03-24T20:28:24Z

@dhermes @tseaver, @calpeyser will be taking a look at implementing this.

Note that it's not enough to just resend the same request after a failure, we need to invoke ReadRows with a start_key that is immediately after the last row key we have successfully read. It looks like this will require some refactoring since PartialRowsData doesn't actually invoke the API call curently.

calpeyser · 2017-03-29T22:13:24Z

Hey guys - I've had a look, here's a proposal:

(1) Implementing Retries - we can probably do this by combining "retryable" in gax-python with the "response iterator" convention currently used in table#read_rows.

Basically, we can wrap the GRPC stream in a callable class that does the following:

Keeps track of a set of row keys that still need to be read as state in the class
When the class is called, pop an element off the stream.
If that element is a valid row, remove that row from the list of keys to be read, and return it.
If that element is a GRPC error, propagate the Exception up the call stack.

We then take the GRPC wrapper, and wrap it again in a gax-python retryable. This way, if an idempotent Exception is propagated up from a call, the call will be retried.

(2) Testing - we need some new functionality over the current unit testing infrastructure, since those tests make assertions about the RPCs made by the client, and do not take responses into account. We'll need to add an integration testing framework that will spin up a mock server and make read_rows calls against it.

@garye's test script should allow us to model errors sent by the server and the expectation of retries. So, we'll need to add

A mock server. I have access to a go dummy, but how do we ensure developers can run tests on all platforms? e.g. if we publish a Linux binary, Mac developers won't be able to run the unit tests. Do I need to re-implement the server in Python?
Python tests that spin up a Client instance connecting the server. This is tricky because the Client API does not seem to support custom configuration values which we need to allow the client to run without credentials, as well as to provide a host and port that points at the locally-running server. That is, we need the moral equivalent of BigtableOptionsFactory.

garye · 2017-03-30T15:20:20Z

(1) With the current interface that only takes a row range, you don't know the row keys that still need to be read. You do know the last key that was read before an error (they come back in lexicographical order), which tells you which row range you need to send in the retried request.

(2) I don't think re-implementing the mock server is Python is a great use of time. We can easily build linux, mac and windows binaries of the mock server and do something like pull one down from a public GCS bucket when the systests run depending on the current platform. I think the systests already install the bigtable emulator via gcloud which is along the same lines.

The shortcut here is to use the BIGTABLE_EMULATOR_HOST environment variable as described here. The python library has support for it and should solve this problem for us. The mock retry server is essentially the emulator with different guts.

calpeyser · 2017-04-06T19:42:45Z

I gave the above a whirl - please see #3279

lukesneeringer · 2017-08-11T21:13:07Z

Hello,
One of the challenges of maintaining a large open source project is that sometimes, you can bite off more than you can chew. As the lead maintainer of google-cloud-python, I can definitely say that I have let the issues here pile up.

As part of trying to get things under control (as well as to empower us to provide better customer service in the future), I am declaring a "bankruptcy" of sorts on many of the old issues, especially those likely to have been addressed or made obsolete by more recent updates.

My goal is to close stale issues whose relevance or solution is no longer immediately evident, and which appear to be of lower importance. I believe in good faith that this is one of those issues, but I am scanning quickly and may occasionally be wrong. If this is an issue of high importance, please comment here and we will reconsider. If this is an issue whose solution is trivial, please consider providing a pull request.

Thank you!

mbrukman · 2017-08-11T21:17:58Z

This issue should be re-opened; this is a critical feature for the Bigtable client library that needs to be tracked until it is completed.

My understanding of the history of this issue: PR #3279 was the first attempt, which was superseded by PR #3324, which was submitted, but then reverted in PR #3642 so this has not yet been merged. There's a branch pr-3324-to-fix that was kept around with the relevant changes that needs to be updated to pass the tests, and then it can be attempted to re-merge.

Please re-open this issue so that we can track its progress.

chemelnucfin · 2018-01-22T18:33:50Z

@jonparrott @zakons Do you know if retry and the recent work in bigtable would have fixed this? If not, that's ok, I'll follow up with this later.
Thanks.

chemelnucfin · 2018-01-22T21:15:27Z

Hello, feature requests will now be tracked in the project Feature Requests. I will close this issue now, please feel free to continue to address any issues/concerns here.

alugowski · 2018-07-17T03:30:15Z

Can someone with permissions please add this to that Feature Requests page?

This issue means that nearly all online examples on how to iterate over Bigtable are misleading; a simple for in loop will unexpectedly crash at some point. The default behavior of iteration should be somewhat retry tolerant.

tseaver · 2018-07-17T18:17:55Z

@alugowski, @sduskis Since the merge of PR #5178, the new low-level GAPIC clients already handle retries, based on the relevant *_client_config modules:

sduskis · 2018-07-17T19:56:12Z

The low level GAPIC retries don't work for streaming RPCs like Cloud Bigtable Read Rows. Default retries just retry the original request until it completes. Retries for streaming RPCs need to pick up from where they previously left off, which requires custom logic to modify the request. We actually put in a fix for this 4 months ago with PR #4882.

tseaver · 2018-07-17T20:08:36Z

@sduskis OK, looks like #4882 is not part of a release, so it will be included in the one we are driving toward this month.

alugowski · 2018-07-17T22:31:30Z

Thanks!

Sounds like this kludge will no longer be necessary:

while True:
    try:
        for row in table.yield_rows(start_key=start_key):
            row_key = row.row_key.decode('utf-8')
            yield (...)

            start_key = row_key  # if iteration raises, we will restart again here
        break
    except Exception as e:
        logging.warning("Exception while iterating over bigtable. Will continue retrying. {}".format(e))

When the iteration takes on the order of hours then the logs are full of this:

<_Rendezvous of RPC that terminated with:
	status = StatusCode.DEADLINE_EXCEEDED
	details = "Error while reading table 'blahblahblah'"
	debug_error_string = "{"created":"@1531838276.260045000","description":"Error received from peer","file":"src/core/lib/surface/call.cc","file_line":1083,"grpc_message":"Error while reading table 'blahblahblah'","grpc_status":4}"

dhermes added the api: bigtable Issues related to the Bigtable API. label Sep 7, 2016

lukesneeringer added priority: p2 Moderately-important priority. Fix may not be included in next release. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design. labels Mar 31, 2017

lukesneeringer closed this as completed Aug 11, 2017

dhermes reopened this Aug 11, 2017

chemelnucfin assigned tseaver, dhermes and chemelnucfin Jan 15, 2018

chemelnucfin closed this as completed Jan 22, 2018

sduskis mentioned this issue Jan 24, 2018

BigTable: Table.read_row: StatusCode.INTERNAL, GOAWAY received #4766

Closed

alugowski unassigned dhermes and chemelnucfin Jul 17, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retry temporary failures for Bigtable scans #2255

Retry temporary failures for Bigtable scans #2255

sduskis commented Sep 7, 2016

dhermes commented Sep 7, 2016

sduskis commented Sep 7, 2016

garye commented Mar 24, 2017

calpeyser commented Mar 29, 2017 •

edited

Loading

garye commented Mar 30, 2017

calpeyser commented Apr 6, 2017

lukesneeringer commented Aug 11, 2017

mbrukman commented Aug 11, 2017

chemelnucfin commented Jan 22, 2018

chemelnucfin commented Jan 22, 2018 •

edited

Loading

alugowski commented Jul 17, 2018

tseaver commented Jul 17, 2018

sduskis commented Jul 17, 2018 •

edited

Loading

tseaver commented Jul 17, 2018

alugowski commented Jul 17, 2018

Retry temporary failures for Bigtable scans #2255

Retry temporary failures for Bigtable scans #2255

Comments

sduskis commented Sep 7, 2016

dhermes commented Sep 7, 2016

sduskis commented Sep 7, 2016

garye commented Mar 24, 2017

calpeyser commented Mar 29, 2017 • edited Loading

garye commented Mar 30, 2017

calpeyser commented Apr 6, 2017

lukesneeringer commented Aug 11, 2017

mbrukman commented Aug 11, 2017

chemelnucfin commented Jan 22, 2018

chemelnucfin commented Jan 22, 2018 • edited Loading

alugowski commented Jul 17, 2018

tseaver commented Jul 17, 2018

sduskis commented Jul 17, 2018 • edited Loading

tseaver commented Jul 17, 2018

alugowski commented Jul 17, 2018

calpeyser commented Mar 29, 2017 •

edited

Loading

chemelnucfin commented Jan 22, 2018 •

edited

Loading

sduskis commented Jul 17, 2018 •

edited

Loading