-
Notifications
You must be signed in to change notification settings - Fork 58
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bigtable: read_rows: no deadline causing stuck clients; no way to change it #6
Comments
Any update on this? |
We have tried to reproduce that issue, starting the emulator on localhost:8086 and following the steps listed in the description. The statement os.kill(emulator_pid, signal.SIGSTOP) in the script tries to stop the emulator to check for the deadline error. Instead, the script hangs with no error messages or statements, right after executing that line. However, when we open another terminal and run the script from there using the same emulator, we receive a 504 Deadline Exceeded error. This happens after about 10s, though the deadline parameter is set to 5s. At this point, it is unclear whether the problem is with the client or the emulator, since the same code ran on two instances of the same terminal produces inconsistent outcomes. |
@mf2199 That is exactly the bug: The Python script hangs forever, despite attempting to set a deadline of 0.5 seconds, as specified by the documentation. The following alternative version "monkey patches" the library to set the deadline on the gRPC call. You will notice this one eventually fails with DeadlineExceeded: from google.api_core import exceptions
from google.cloud import bigtable
from google.rpc import code_pb2
from google.rpc import status_pb2
import os
import google.oauth2.credentials
import signal
import sys
import tempfile
from google.cloud.bigtable_v2.gapic.transports import bigtable_grpc_transport
COLUMN_FAMILY_ID = 'column_family_id'
EMULATOR_ENVVAR = 'BIGTABLE_EMULATOR_HOST'
NUM_ROWS = 1000
def patched_read_rows_property(bigtable_transport, request):
real_read_rows = bigtable_transport._stubs['bigtable_stub'].ReadRows
print('patched_read_rows_property calling {} with timeout=0.5'.format(repr(real_read_rows)))
return real_read_rows(request, timeout=0.5)
def main():
# don't require real Google credentials to run this test
fake_credentials = google.oauth2.credentials.Credentials('fake_token')
if os.environ.get(EMULATOR_ENVVAR, '') == '':
raise ValueError('Must specify ' + EMULATOR_ENVVAR)
emulator_pid = int(sys.argv[1])
client = bigtable.Client(project="testing", admin=True, credentials=fake_credentials)
instance = client.instance("emulator")
# create/open a table
table = instance.table("emulator_table")
column_family = table.column_family(COLUMN_FAMILY_ID)
try:
table.create()
column_family.create()
except exceptions.AlreadyExists:
print('table exists')
# write a bunch of data
for i in range(NUM_ROWS):
k = 'some_key_{:04d}'.format(i)
row = table.row(k)
row.set_cell(COLUMN_FAMILY_ID, 'column', 'some_value{:d}'.format(i) * 1000)
result = table.mutate_rows([row])
assert len(result) == 1 and result[0].code == code_pb2.OK
assert table.read_row(k) is not None
print('wrote {:d} rows'.format(NUM_ROWS))
print('patching BigtableGrpcTransport ...')
bigtable_grpc_transport.BigtableGrpcTransport.read_rows = patched_read_rows_property
print('calling read_rows with deadline ...')
rows = table.read_rows(retry=bigtable.table.DEFAULT_RETRY_READ_ROWS.with_deadline(5.0))
rows_iter = iter(rows)
print('calling next(...) to start the read operation ...')
r1 = next(rows_iter)
os.kill(emulator_pid, signal.SIGSTOP)
print('sent sigstop; iterating through rows (will get stuck) ...')
count = 0
for r in rows_iter:
count += 1
print('read {:d} rows'.format(count))
print('done')
if __name__ == '__main__':
main() |
Actually, when I read your comment again, I see that it sounds like you are getting some other result. The result I get when executing the above test, with the
After this point it will hang forever, or until we resume the Bigtable emulator with The good news: This shows that I misunderstood the retry logic because I didn't understand the |
Hi @mf2199! Are you able to take a look at this? |
@kolea2 Hi there. Yes, I'm onto it. The claim about gapic On a side note, there might be an inconsistency in the Documentation where it says "set the |
Table.read_rows
does not set any deadline, so it can hang forever if the Bigtable server connection hangs. We see this happening once every week or two when running inside GCP, which causes our server to get stuck indefinitely. There appears to be no way in the API to set a deadline, even though the documentation says that theretry
parameter should do this. Due to a bug, it does not.Details:
We are calling
Table.read_rows
to read ~2 rows from BigTable. Using pyflame on a stuck process, both worker threads were waiting on Bigtable, with the stack trace below. I believe the bug is the following:Table.read_rows
. This callsPartialRowsData
, passing theretry
argument which defaults toDEFAULT_RETRY_READ_ROWS
. The default misleadingly setsdeadline=60.0
. ; It also passesread_method=self._instance._client.table_data_client.transport.read_rows
toPartialRowsData
, which is a method onBigtableGrpcTransport
.PartialRowsData.__init__
callsread_method()
; this is actually raw gRPC_UnaryStreamMultiCallable
, not the gapicBigtableClient.read_rows
, which AFAICS, is never called. Hence, this gRPC streaming call is started with any deadline.PartialRowsData.__iter__
callsself._read_next_response
, which callsreturn self.retry(self._read_next, on_error=self._on_error)()
. This gives the impression thatretry
is used, but if I understand gRPC streams correctly, I'm not sure that even makes sense. I think even if the gRPC stream return some error, callingnext
won't actually retry the gRPC, it will just immediately raise the same exception. To retry, I believe you need to actually restart it by callingread_rows
again.Possible fix:
Change
Table.read_rows
call the gapicBigtableClient.read_rows
with theretry
parameter., and changePartialRowsData.__init__
to take this response iterator, and not take aretry
parameter at all. This would at least allow setting the gRPC streaming call deadline, although I don't think it will make retrying work (since I think the gRPC streaming client will just immediately returns an iterator without actually waiting for a response from the server?)I haven't actually tried implementing this to see if it works. For now, we will probably just make a raw gRPC read_rows call so we can set an appropriate timeout.
Environment details
OS: Linux, ContainerOS (GKE), Container is Debian9 (using distroless)
Python: 3.5.3
API: google-cloud-bigtable 0.33.0
Steps to reproduce
This program loads the Bigtable emulator with 1000 rows, calls
read_rows(retry=DEFAULT.with_deadline(5.0))
, then sendsSIGSTOP
to pause the emulator. This SHOULD cause aDeadlineExceeded
exception to be raised after 5 seconds. Instead, it hangs forever.gcloud beta emulators bigtable start
ps ax | grep cbtemulator
BIGTABLE_EMULATOR_HOST=localhost:8086 python3 bug.py $PID
Stack trace of hung server (using slightly older version of the google-cloud-bigtable library
The text was updated successfully, but these errors were encountered: