Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bigtable: read_rows: no deadline causing stuck clients; no way to change it #6

Closed
evanj opened this issue May 29, 2019 · 6 comments · Fixed by googleapis/python-api-core#20 or #16
Assignees
Labels
api: bigtable Issues related to the googleapis/python-bigtable API. priority: p2 Moderately-important priority. Fix may not be included in next release. 🚨 This issue needs some love. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.

Comments

@evanj
Copy link

evanj commented May 29, 2019

Table.read_rows does not set any deadline, so it can hang forever if the Bigtable server connection hangs. We see this happening once every week or two when running inside GCP, which causes our server to get stuck indefinitely. There appears to be no way in the API to set a deadline, even though the documentation says that the retry parameter should do this. Due to a bug, it does not.

Details:

We are calling Table.read_rows to read ~2 rows from BigTable. Using pyflame on a stuck process, both worker threads were waiting on Bigtable, with the stack trace below. I believe the bug is the following:

  1. Call Table.read_rows. This calls PartialRowsData, passing the retry argument which defaults to DEFAULT_RETRY_READ_ROWS. The default misleadingly sets deadline=60.0. ; It also passes read_method=self._instance._client.table_data_client.transport.read_rows to PartialRowsData, which is a method on BigtableGrpcTransport.
  2. PartialRowsData.__init__ calls read_method(); this is actually raw gRPC _UnaryStreamMultiCallable, not the gapic BigtableClient.read_rows, which AFAICS, is never called. Hence, this gRPC streaming call is started with any deadline.
  3. PartialRowsData.__iter__ calls self._read_next_response, which calls return self.retry(self._read_next, on_error=self._on_error)(). This gives the impression that retry is used, but if I understand gRPC streams correctly, I'm not sure that even makes sense. I think even if the gRPC stream return some error, calling next won't actually retry the gRPC, it will just immediately raise the same exception. To retry, I believe you need to actually restart it by calling read_rows again.
  4. If the Bigtable server now "hangs", the client hangs forever.

Possible fix:

Change Table.read_rows call the gapic BigtableClient.read_rows with the retry parameter., and change PartialRowsData.__init__ to take this response iterator, and not take a retry parameter at all. This would at least allow setting the gRPC streaming call deadline, although I don't think it will make retrying work (since I think the gRPC streaming client will just immediately returns an iterator without actually waiting for a response from the server?)

I haven't actually tried implementing this to see if it works. For now, we will probably just make a raw gRPC read_rows call so we can set an appropriate timeout.

Environment details

OS: Linux, ContainerOS (GKE), Container is Debian9 (using distroless)
Python: 3.5.3
API: google-cloud-bigtable 0.33.0

Steps to reproduce

This program loads the Bigtable emulator with 1000 rows, calls read_rows(retry=DEFAULT.with_deadline(5.0)), then sends SIGSTOP to pause the emulator. This SHOULD cause a DeadlineExceeded exception to be raised after 5 seconds. Instead, it hangs forever.

  1. Start the Bigtable emulator: gcloud beta emulators bigtable start
  2. Find the PID: ps ax | grep cbtemulator
  3. Run the following program with BIGTABLE_EMULATOR_HOST=localhost:8086 python3 bug.py $PID
from google.api_core import exceptions
from google.cloud import bigtable
from google.rpc import code_pb2
from google.rpc import status_pb2
import os
import signal
import sys

COLUMN_FAMILY_ID = 'column_family_id'

def main():
    emulator_pid = int(sys.argv[1])
    client = bigtable.Client(project="testing", admin=True)
    instance = client.instance("emulator")

    # create/open a table
    table = instance.table("emulator_table")
    column_family = table.column_family(COLUMN_FAMILY_ID)
    try:
        table.create()
        column_family.create()
    except exceptions.AlreadyExists:
        print('table exists')

    # write a bunch of data
    for i in range(1000):
        k = 'some_key_{:04d}'.format(i)
        print(k)
        row = table.row(k)
        row.set_cell(COLUMN_FAMILY_ID, 'column', 'some_value{:d}'.format(i) * 1000)
        result = table.mutate_rows([row])
        assert len(result) == 1 and result[0].code == code_pb2.OK
        assert table.read_row(k) is not None

    print('starting read')
    rows = table.read_rows(retry=bigtable.table.DEFAULT_RETRY_READ_ROWS.with_deadline(5.0))
    rows_iter = iter(rows)
    r1 = next(rows_iter)
    print('read', r1)
    os.kill(emulator_pid, signal.SIGSTOP)
    print('sent sigstop')
    for r in rows_iter:
        print(r)
    print('done')


if __name__ == '__main__':
    main()

Stack trace of hung server (using slightly older version of the google-cloud-bigtable library

/usr/local/lib/python2.7/threading.py:wait:340
/usr/local/lib/python2.7/site-packages/grpc/_channel.py:_next:348
/usr/local/lib/python2.7/site-packages/grpc/_channel.py:next:366
/usr/local/lib/python2.7/site-packages/google/cloud/bigtable/row_data.py:_read_next:426
/usr/local/lib/python2.7/site-packages/google/api_core/retry.py:retry_target:179
/usr/local/lib/python2.7/site-packages/google/api_core/retry.py:retry_wrapped_func:270
/usr/local/lib/python2.7/site-packages/google/cloud/bigtable/row_data.py:_read_next_response:430
/usr/local/lib/python2.7/site-packages/google/cloud/bigtable/row_data.py:__iter__:441
@evanj evanj changed the title Bigtable: read_rows: 1 hour deadline causing stuck clients; no way to change it Bigtable: read_rows: no deadline causing stuck clients; no way to change it May 29, 2019
@ahill00
Copy link

ahill00 commented Jun 10, 2019

Any update on this?

@mf2199
Copy link
Contributor

mf2199 commented Jul 4, 2019

We have tried to reproduce that issue, starting the emulator on localhost:8086 and following the steps listed in the description. The statement os.kill(emulator_pid, signal.SIGSTOP) in the script tries to stop the emulator to check for the deadline error. Instead, the script hangs with no error messages or statements, right after executing that line. However, when we open another terminal and run the script from there using the same emulator, we receive a 504 Deadline Exceeded error. This happens after about 10s, though the deadline parameter is set to 5s. At this point, it is unclear whether the problem is with the client or the emulator, since the same code ran on two instances of the same terminal produces inconsistent outcomes.

@evanj
Copy link
Author

evanj commented Jul 10, 2019

@mf2199 That is exactly the bug: The Python script hangs forever, despite attempting to set a deadline of 0.5 seconds, as specified by the documentation. The following alternative version "monkey patches" the library to set the deadline on the gRPC call. You will notice this one eventually fails with DeadlineExceeded:

from google.api_core import exceptions
from google.cloud import bigtable
from google.rpc import code_pb2
from google.rpc import status_pb2
import os
import google.oauth2.credentials
import signal
import sys
import tempfile
from google.cloud.bigtable_v2.gapic.transports import bigtable_grpc_transport

COLUMN_FAMILY_ID = 'column_family_id'
EMULATOR_ENVVAR = 'BIGTABLE_EMULATOR_HOST'
NUM_ROWS = 1000

def patched_read_rows_property(bigtable_transport, request):
    real_read_rows = bigtable_transport._stubs['bigtable_stub'].ReadRows
    print('patched_read_rows_property calling {} with timeout=0.5'.format(repr(real_read_rows)))
    return real_read_rows(request, timeout=0.5)


def main():
    # don't require real Google credentials to run this test
    fake_credentials = google.oauth2.credentials.Credentials('fake_token')

    if os.environ.get(EMULATOR_ENVVAR, '') == '':
        raise ValueError('Must specify ' + EMULATOR_ENVVAR)

    emulator_pid = int(sys.argv[1])
    client = bigtable.Client(project="testing", admin=True, credentials=fake_credentials)
    instance = client.instance("emulator")


    # create/open a table
    table = instance.table("emulator_table")
    column_family = table.column_family(COLUMN_FAMILY_ID)
    try:
        table.create()
        column_family.create()
    except exceptions.AlreadyExists:
        print('table exists')

    # write a bunch of data
    for i in range(NUM_ROWS):
        k = 'some_key_{:04d}'.format(i)
        row = table.row(k)
        row.set_cell(COLUMN_FAMILY_ID, 'column', 'some_value{:d}'.format(i) * 1000)
        result = table.mutate_rows([row])
        assert len(result) == 1 and result[0].code == code_pb2.OK
        assert table.read_row(k) is not None
    print('wrote {:d} rows'.format(NUM_ROWS))

    print('patching BigtableGrpcTransport ...')
    bigtable_grpc_transport.BigtableGrpcTransport.read_rows = patched_read_rows_property

    print('calling read_rows with deadline ...')
    rows = table.read_rows(retry=bigtable.table.DEFAULT_RETRY_READ_ROWS.with_deadline(5.0))
    rows_iter = iter(rows)
    print('calling next(...) to start the read operation ...')
    r1 = next(rows_iter)

    os.kill(emulator_pid, signal.SIGSTOP)
    print('sent sigstop; iterating through rows (will get stuck) ...')

    count = 0
    for r in rows_iter:
        count += 1
        print('read {:d} rows'.format(count))
    print('done')


if __name__ == '__main__':
    main()

@evanj
Copy link
Author

evanj commented Jul 10, 2019

Actually, when I read your comment again, I see that it sounds like you are getting some other result. The result I get when executing the above test, with the patching ... lines 53 and 54 commented, on a Linux machine is the following:

$ BIGTABLE_EMULATOR_HOST=localhost:8086 venv3/bin/python script.py 11025
wrote 1000 rows
calling read_rows with deadline ...
calling next(...) to start the read operation ...
sent sigstop; iterating through rows (will get stuck) ...
read 1 rows
... (326 similar lines removed) ...
read 328 rows

After this point it will hang forever, or until we resume the Bigtable emulator with kill -CONT 11025. The version with the patch will eventually fail with a DeadlineExceeded exception, after it gets retried a few times.

The good news: This shows that I misunderstood the retry logic because I didn't understand the _on_error handler. The stream request is retried according to the Retry that is passed in. The part that is missing is propagating the deadline to the gRPC call, because this calls the raw gRPC stub instead of BigtableClient.read_rows

@crwilcox crwilcox transferred this issue from googleapis/google-cloud-python Jan 31, 2020
@product-auto-label product-auto-label bot added the api: bigtable Issues related to the googleapis/python-bigtable API. label Jan 31, 2020
@yoshi-automation yoshi-automation added triage me I really want to be triaged. 🚨 This issue needs some love. labels Feb 3, 2020
@crwilcox crwilcox added type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. priority: p1 Important issue which blocks shipping the next release. Will be fixed prior to next release. labels Feb 4, 2020
@frankyn frankyn removed 🚨 This issue needs some love. triage me I really want to be triaged. labels Feb 4, 2020
@yoshi-automation yoshi-automation added the 🚨 This issue needs some love. label Feb 4, 2020
@kolea2 kolea2 added priority: p2 Moderately-important priority. Fix may not be included in next release. and removed 🚨 This issue needs some love. priority: p1 Important issue which blocks shipping the next release. Will be fixed prior to next release. labels Feb 13, 2020
@yoshi-automation yoshi-automation added the 🚨 This issue needs some love. label Feb 13, 2020
@kolea2
Copy link
Collaborator

kolea2 commented Mar 3, 2020

Hi @mf2199! Are you able to take a look at this?

@mf2199
Copy link
Contributor

mf2199 commented Mar 3, 2020

@kolea2 Hi there. Yes, I'm onto it. The claim about gapic BigtableClient.read_rows being bypassed in favor of gRPC BigtableStub seems to be true. A 'quick-and-dirty' method could be, as the author suggests, to reroute the calls through gapic and see whether that solves the problem. But the results are different if the code is run in a Windows environment. The execution is immediately terminated with "grpc_message":"Stream removed" type of an error thrown by grps/_channel/_MultiThreadedRendezvous and no hint for hanging or waiting for a deadline whatsoever. Why the same code that ought to work cross-platform behaves differently, remains to be seen. I'm setting up a Linux VM, again, so to try and get additional reproduction of the error which may permit a better insight.

On a side note, there might be an inconsistency in the Documentation where it says "set the BIGTABLE_EMULATOR_HOST environment variable to the host and port where the Cloud Bigtable emulator is running". The Client.__init__() constructor, however, looks for BIGTABLE_EMULATOR and so fails to enter the emulator mode unless the variable name is corrected.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigtable Issues related to the googleapis/python-bigtable API. priority: p2 Moderately-important priority. Fix may not be included in next release. 🚨 This issue needs some love. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.
Projects
None yet
7 participants