Bigtable: read_rows: no deadline causing stuck clients; no way to change it #6

evanj · 2019-05-29T00:22:56Z

Table.read_rows does not set any deadline, so it can hang forever if the Bigtable server connection hangs. We see this happening once every week or two when running inside GCP, which causes our server to get stuck indefinitely. There appears to be no way in the API to set a deadline, even though the documentation says that the retry parameter should do this. Due to a bug, it does not.

Details:

We are calling Table.read_rows to read ~2 rows from BigTable. Using pyflame on a stuck process, both worker threads were waiting on Bigtable, with the stack trace below. I believe the bug is the following:

Call Table.read_rows. This calls PartialRowsData, passing the retry argument which defaults to DEFAULT_RETRY_READ_ROWS. The default misleadingly sets deadline=60.0. ; It also passes read_method=self._instance._client.table_data_client.transport.read_rows to PartialRowsData, which is a method on BigtableGrpcTransport.
PartialRowsData.__init__ calls read_method(); this is actually raw gRPC _UnaryStreamMultiCallable, not the gapic BigtableClient.read_rows, which AFAICS, is never called. Hence, this gRPC streaming call is started with any deadline.
PartialRowsData.__iter__ calls self._read_next_response, which calls return self.retry(self._read_next, on_error=self._on_error)(). This gives the impression that retry is used, but if I understand gRPC streams correctly, I'm not sure that even makes sense. I think even if the gRPC stream return some error, calling next won't actually retry the gRPC, it will just immediately raise the same exception. To retry, I believe you need to actually restart it by calling read_rows again.
If the Bigtable server now "hangs", the client hangs forever.

Possible fix:

Change Table.read_rows call the gapic BigtableClient.read_rows with the retry parameter., and change PartialRowsData.__init__ to take this response iterator, and not take a retry parameter at all. This would at least allow setting the gRPC streaming call deadline, although I don't think it will make retrying work (since I think the gRPC streaming client will just immediately returns an iterator without actually waiting for a response from the server?)

I haven't actually tried implementing this to see if it works. For now, we will probably just make a raw gRPC read_rows call so we can set an appropriate timeout.

Environment details

OS: Linux, ContainerOS (GKE), Container is Debian9 (using distroless)
Python: 3.5.3
API: google-cloud-bigtable 0.33.0

Steps to reproduce

This program loads the Bigtable emulator with 1000 rows, calls read_rows(retry=DEFAULT.with_deadline(5.0)), then sends SIGSTOP to pause the emulator. This SHOULD cause a DeadlineExceeded exception to be raised after 5 seconds. Instead, it hangs forever.

Start the Bigtable emulator: gcloud beta emulators bigtable start
Find the PID: ps ax | grep cbtemulator
Run the following program with BIGTABLE_EMULATOR_HOST=localhost:8086 python3 bug.py $PID

from google.api_core import exceptions
from google.cloud import bigtable
from google.rpc import code_pb2
from google.rpc import status_pb2
import os
import signal
import sys

COLUMN_FAMILY_ID = 'column_family_id'

def main():
    emulator_pid = int(sys.argv[1])
    client = bigtable.Client(project="testing", admin=True)
    instance = client.instance("emulator")

    # create/open a table
    table = instance.table("emulator_table")
    column_family = table.column_family(COLUMN_FAMILY_ID)
    try:
        table.create()
        column_family.create()
    except exceptions.AlreadyExists:
        print('table exists')

    # write a bunch of data
    for i in range(1000):
        k = 'some_key_{:04d}'.format(i)
        print(k)
        row = table.row(k)
        row.set_cell(COLUMN_FAMILY_ID, 'column', 'some_value{:d}'.format(i) * 1000)
        result = table.mutate_rows([row])
        assert len(result) == 1 and result[0].code == code_pb2.OK
        assert table.read_row(k) is not None

    print('starting read')
    rows = table.read_rows(retry=bigtable.table.DEFAULT_RETRY_READ_ROWS.with_deadline(5.0))
    rows_iter = iter(rows)
    r1 = next(rows_iter)
    print('read', r1)
    os.kill(emulator_pid, signal.SIGSTOP)
    print('sent sigstop')
    for r in rows_iter:
        print(r)
    print('done')


if __name__ == '__main__':
    main()

Stack trace of hung server (using slightly older version of the google-cloud-bigtable library

/usr/local/lib/python2.7/threading.py:wait:340
/usr/local/lib/python2.7/site-packages/grpc/_channel.py:_next:348
/usr/local/lib/python2.7/site-packages/grpc/_channel.py:next:366
/usr/local/lib/python2.7/site-packages/google/cloud/bigtable/row_data.py:_read_next:426
/usr/local/lib/python2.7/site-packages/google/api_core/retry.py:retry_target:179
/usr/local/lib/python2.7/site-packages/google/api_core/retry.py:retry_wrapped_func:270
/usr/local/lib/python2.7/site-packages/google/cloud/bigtable/row_data.py:_read_next_response:430
/usr/local/lib/python2.7/site-packages/google/cloud/bigtable/row_data.py:__iter__:441

The text was updated successfully, but these errors were encountered:

ahill00 · 2019-06-10T14:39:49Z

Any update on this?

mf2199 · 2019-07-04T15:15:34Z

We have tried to reproduce that issue, starting the emulator on localhost:8086 and following the steps listed in the description. The statement os.kill(emulator_pid, signal.SIGSTOP) in the script tries to stop the emulator to check for the deadline error. Instead, the script hangs with no error messages or statements, right after executing that line. However, when we open another terminal and run the script from there using the same emulator, we receive a 504 Deadline Exceeded error. This happens after about 10s, though the deadline parameter is set to 5s. At this point, it is unclear whether the problem is with the client or the emulator, since the same code ran on two instances of the same terminal produces inconsistent outcomes.

evanj · 2019-07-10T20:13:58Z

@mf2199 That is exactly the bug: The Python script hangs forever, despite attempting to set a deadline of 0.5 seconds, as specified by the documentation. The following alternative version "monkey patches" the library to set the deadline on the gRPC call. You will notice this one eventually fails with DeadlineExceeded:

from google.api_core import exceptions
from google.cloud import bigtable
from google.rpc import code_pb2
from google.rpc import status_pb2
import os
import google.oauth2.credentials
import signal
import sys
import tempfile
from google.cloud.bigtable_v2.gapic.transports import bigtable_grpc_transport

COLUMN_FAMILY_ID = 'column_family_id'
EMULATOR_ENVVAR = 'BIGTABLE_EMULATOR_HOST'
NUM_ROWS = 1000

def patched_read_rows_property(bigtable_transport, request):
    real_read_rows = bigtable_transport._stubs['bigtable_stub'].ReadRows
    print('patched_read_rows_property calling {} with timeout=0.5'.format(repr(real_read_rows)))
    return real_read_rows(request, timeout=0.5)


def main():
    # don't require real Google credentials to run this test
    fake_credentials = google.oauth2.credentials.Credentials('fake_token')

    if os.environ.get(EMULATOR_ENVVAR, '') == '':
        raise ValueError('Must specify ' + EMULATOR_ENVVAR)

    emulator_pid = int(sys.argv[1])
    client = bigtable.Client(project="testing", admin=True, credentials=fake_credentials)
    instance = client.instance("emulator")


    # create/open a table
    table = instance.table("emulator_table")
    column_family = table.column_family(COLUMN_FAMILY_ID)
    try:
        table.create()
        column_family.create()
    except exceptions.AlreadyExists:
        print('table exists')

    # write a bunch of data
    for i in range(NUM_ROWS):
        k = 'some_key_{:04d}'.format(i)
        row = table.row(k)
        row.set_cell(COLUMN_FAMILY_ID, 'column', 'some_value{:d}'.format(i) * 1000)
        result = table.mutate_rows([row])
        assert len(result) == 1 and result[0].code == code_pb2.OK
        assert table.read_row(k) is not None
    print('wrote {:d} rows'.format(NUM_ROWS))

    print('patching BigtableGrpcTransport ...')
    bigtable_grpc_transport.BigtableGrpcTransport.read_rows = patched_read_rows_property

    print('calling read_rows with deadline ...')
    rows = table.read_rows(retry=bigtable.table.DEFAULT_RETRY_READ_ROWS.with_deadline(5.0))
    rows_iter = iter(rows)
    print('calling next(...) to start the read operation ...')
    r1 = next(rows_iter)

    os.kill(emulator_pid, signal.SIGSTOP)
    print('sent sigstop; iterating through rows (will get stuck) ...')

    count = 0
    for r in rows_iter:
        count += 1
        print('read {:d} rows'.format(count))
    print('done')


if __name__ == '__main__':
    main()

evanj · 2019-07-10T20:19:07Z

Actually, when I read your comment again, I see that it sounds like you are getting some other result. The result I get when executing the above test, with the patching ... lines 53 and 54 commented, on a Linux machine is the following:

$ BIGTABLE_EMULATOR_HOST=localhost:8086 venv3/bin/python script.py 11025
wrote 1000 rows
calling read_rows with deadline ...
calling next(...) to start the read operation ...
sent sigstop; iterating through rows (will get stuck) ...
read 1 rows
... (326 similar lines removed) ...
read 328 rows

After this point it will hang forever, or until we resume the Bigtable emulator with kill -CONT 11025. The version with the patch will eventually fail with a DeadlineExceeded exception, after it gets retried a few times.

The good news: This shows that I misunderstood the retry logic because I didn't understand the _on_error handler. The stream request is retried according to the Retry that is passed in. The part that is missing is propagating the deadline to the gRPC call, because this calls the raw gRPC stub instead of BigtableClient.read_rows

kolea2 · 2020-03-03T19:41:32Z

Hi @mf2199! Are you able to take a look at this?

mf2199 · 2020-03-03T21:23:08Z

@kolea2 Hi there. Yes, I'm onto it. The claim about gapic BigtableClient.read_rows being bypassed in favor of gRPC BigtableStub seems to be true. A 'quick-and-dirty' method could be, as the author suggests, to reroute the calls through gapic and see whether that solves the problem. But the results are different if the code is run in a Windows environment. The execution is immediately terminated with "grpc_message":"Stream removed" type of an error thrown by grps/_channel/_MultiThreadedRendezvous and no hint for hanging or waiting for a deadline whatsoever. Why the same code that ought to work cross-platform behaves differently, remains to be seen. I'm setting up a Linux VM, again, so to try and get additional reproduction of the error which may permit a better insight.

On a side note, there might be an inconsistency in the Documentation where it says "set the BIGTABLE_EMULATOR_HOST environment variable to the host and port where the Cloud Bigtable emulator is running". The Client.__init__() constructor, however, looks for BIGTABLE_EMULATOR and so fails to enter the emulator mode unless the variable name is corrected.

evanj changed the title ~~Bigtable: read_rows: 1 hour deadline causing stuck clients; no way to change it~~ Bigtable: read_rows: no deadline causing stuck clients; no way to change it May 29, 2019

crwilcox transferred this issue from googleapis/google-cloud-python Jan 31, 2020

product-auto-label bot added the api: bigtable Issues related to the googleapis/python-bigtable API. label Jan 31, 2020

yoshi-automation added triage me I really want to be triaged. 🚨 This issue needs some love. labels Feb 3, 2020

crwilcox added type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. priority: p1 Important issue which blocks shipping the next release. Will be fixed prior to next release. labels Feb 4, 2020

frankyn removed 🚨 This issue needs some love. triage me I really want to be triaged. labels Feb 4, 2020

yoshi-automation added the 🚨 This issue needs some love. label Feb 4, 2020

crwilcox assigned mf2199 Feb 11, 2020

kolea2 added priority: p2 Moderately-important priority. Fix may not be included in next release. and removed 🚨 This issue needs some love. priority: p1 Important issue which blocks shipping the next release. Will be fixed prior to next release. labels Feb 13, 2020

yoshi-automation added the 🚨 This issue needs some love. label Feb 13, 2020

This was referenced Apr 1, 2020

feat: [CBT-6 helper] Exposing Retry._deadline as a property googleapis/python-api-core#20

Merged

fix: pass timeout to 'PartialRowsData.response_iterator' #16

Merged

kolea2 closed this as completed in googleapis/python-api-core#20 May 28, 2020

mf2199 added a commit to MaxxleLLC/python-bigtable that referenced this issue May 28, 2020

Merge branch 'master' into googleapis#6---Bigtable-read_rows-no-deadline

9bd6386

mf2199 added a commit to MaxxleLLC/python-bigtable that referenced this issue Jun 9, 2020

Merge branch 'master' into googleapis#6---Bigtable-read_rows-no-deadline

1ad1f35

mf2199 added a commit to MaxxleLLC/python-bigtable that referenced this issue Jul 2, 2020

Merge branch 'master' into googleapis#6---Bigtable-read_rows-no-deadline

a132409

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bigtable: read_rows: no deadline causing stuck clients; no way to change it #6

Bigtable: read_rows: no deadline causing stuck clients; no way to change it #6

evanj commented May 29, 2019

ahill00 commented Jun 10, 2019

mf2199 commented Jul 4, 2019

evanj commented Jul 10, 2019

evanj commented Jul 10, 2019 •

edited

Loading

kolea2 commented Mar 3, 2020

mf2199 commented Mar 3, 2020

Bigtable: read_rows: no deadline causing stuck clients; no way to change it #6

Bigtable: read_rows: no deadline causing stuck clients; no way to change it #6

Comments

evanj commented May 29, 2019

Details:

Possible fix:

Environment details

Steps to reproduce

Stack trace of hung server (using slightly older version of the google-cloud-bigtable library

ahill00 commented Jun 10, 2019

mf2199 commented Jul 4, 2019

evanj commented Jul 10, 2019

evanj commented Jul 10, 2019 • edited Loading

kolea2 commented Mar 3, 2020

mf2199 commented Mar 3, 2020

evanj commented Jul 10, 2019 •

edited

Loading