Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pub/Sub: publish message hangs waiting for previous publish to timeout #8036

Closed
asnr opened this issue May 20, 2019 · 2 comments · Fixed by #8234
Closed

Pub/Sub: publish message hangs waiting for previous publish to timeout #8036

asnr opened this issue May 20, 2019 · 2 comments · Fixed by #8234
Assignees
Labels
api: pubsub Issues related to the Pub/Sub API. triaged for GA type: question Request for information or clarification. Not an issue.

Comments

@asnr
Copy link

asnr commented May 20, 2019

Environment details

OS type and version: 4.9.125-linuxkit GNU/Linux
Python version and virtual environment information: Python 2.7.16
google-cloud-pubsub package version: 0.41.0

Steps to reproduce

  1. Publish a (first) message to PubSub that fails
  2. Timeout on the future object returned by the call to publish(), before the grpc publish call in the batch thread returns
  3. Publish a (second) message to the same topic. This call hangs until the previous call to grpc publish in the batch thread returns. As the default timeout is currently 10 minutes (!) this can take take 10 minutes to return.

Hanging for 10 minutes is surprising behaviour for an asynchronous API.

Code example

See this gist for code and instructions on how to reproduce this issue.

Wot I think is going on here

This is wild conjecture that I have no supporting evidence for.

That being said, I think the issue starts when the batch thread gets stuck in this call to grpc publish. At this point it is holding onto the lock _state_lock and will continue to hold on to it for 10 minutes until it the call to grpc publish times out.

When the client application calls publish() in the main thread for the second time, it will try to acquire the same lock _state_lock. As this lock is already being held by the batch thread, the main thread hangs and doesn't return from the call to publish().

@sduskis sduskis added api: pubsub Issues related to the Pub/Sub API. priority: p1 Important issue which blocks shipping the next release. Will be fixed prior to next release. triaged for GA type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. labels May 20, 2019
@sduskis sduskis added type: question Request for information or clarification. Not an issue. and removed priority: p1 Important issue which blocks shipping the next release. Will be fixed prior to next release. triaged for GA type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. labels May 20, 2019
@plamut
Copy link
Contributor

plamut commented Jun 5, 2019

@asnr Thank you for the effort and the detailed steps to reproduce the issue.

I can confirm that the issue is reproducible, either by using the linked Docker application, or by simply disabling the internet connection and running the test publisher script (without creating the topic and subscription, that is).

The cause of the long delay is that the lock in the underlying batch (an object that batches publish requests) is held for too long. It also turned out that the fix for it is essentially the same as #7686.

I will open a follow-up PR that also includes tests, and mention the creators of the original PR as co-authors.

@plamut
Copy link
Contributor

plamut commented Jun 6, 2019

@Dan4London Just FYI, the pull request for this issue that you reported in the other thread has been created.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: pubsub Issues related to the Pub/Sub API. triaged for GA type: question Request for information or clarification. Not an issue.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants