-
Notifications
You must be signed in to change notification settings - Fork 88
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Core: Race condition in bidi.BackgroundConsumer? #4
Comments
@plamut I'm in agreement with your analysis here: the stdlib's docs hightlght the problem clearly. |
The "obvious" fix: --- a/api_core/google/api_core/bidi.py
+++ b/api_core/google/api_core/bidi.py
@@ -642,14 +642,15 @@ class BackgroundConsumer(object):
# In the future, we could use `Condition.wait_for` if we drop
# Python 2.7.
with self._wake:
- if self._paused:
+ while self._paused:
_LOGGER.debug("paused, waiting for waking.")
self._wake.wait()
_LOGGER.debug("woken.")
- _LOGGER.debug("waiting for recv.")
- response = self._bidi_rpc.recv()
- _LOGGER.debug("recved response.")
+ _LOGGER.debug("waiting for recv.")
+ response = self._bidi_rpc.recv()
+ _LOGGER.debug("recved response.")
+
self._on_response(response)
except exceptions.GoogleAPICallError as exc: passes all unit tests in $ .nox/system-3-7/bin/py.test --verbose tests/system.py
============================= test session starts ==============================
platform linux -- Python 3.7.1, pytest-5.0.1, py-1.8.0, pluggy-0.12.0 -- /.../firestore/.nox/system-3-7/bin/python3.7
cachedir: .pytest_cache
rootdir: /.../firestore
collected 25 items
tests/system.py::test_collections PASSED [ 4%]
tests/system.py::test_create_document PASSED [ 8%]
tests/system.py::test_create_document_w_subcollection PASSED [ 12%]
tests/system.py::test_cannot_use_foreign_key PASSED [ 16%]
tests/system.py::test_no_document PASSED [ 20%]
tests/system.py::test_document_set PASSED [ 24%]
tests/system.py::test_document_integer_field PASSED [ 28%]
tests/system.py::test_document_set_merge PASSED [ 32%]
tests/system.py::test_document_set_w_int_field PASSED [ 36%]
tests/system.py::test_document_update_w_int_field PASSED [ 40%]
tests/system.py::test_update_document PASSED [ 44%]
tests/system.py::test_document_get PASSED [ 48%]
tests/system.py::test_document_delete PASSED [ 52%]
tests/system.py::test_collection_add PASSED [ 56%]
tests/system.py::test_query_stream PASSED [ 60%]
tests/system.py::test_query_unary PASSED [ 64%]
tests/system.py::test_collection_group_queries PASSED [ 68%]
tests/system.py::test_collection_group_queries_startat_endat PASSED [ 72%]
tests/system.py::test_collection_group_queries_filters PASSED [ 76%]
tests/system.py::test_get_all PASSED [ 80%]
tests/system.py::test_batch PASSED [ 84%]
tests/system.py::test_watch_document PASSED [ 88%]^C
...
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! KeyboardInterrupt !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
/opt/Python-3.7.1/lib/python3.7/threading.py:241: KeyboardInterrupt and $ .nox/system-3-7/bin/py.test --verbose tests/system.py tests/system/
============================= test session starts ==============================
platform linux -- Python 3.7.1, pytest-5.0.1, py-1.8.0, pluggy-0.12.0 -- /.../pubsub/.nox/system-3-7/bin/python3.7
cachedir: .pytest_cache
rootdir: /.../pubsub
collected 12 items
tests/system.py::test_publish_messages PASSED [ 8%]
tests/system.py::test_subscribe_to_messages PASSED [ 16%]
tests/system.py::test_subscribe_to_messages_async_callbacks ^C
...
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! KeyboardInterrupt !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
/opt/Python-3.7.1/lib/python3.7/threading.py:241: KeyboardInterrupt |
Did a quick check (for Update: |
@plamut believes that googleapis/google-cloud-python#8883 did not completely resolve the race, so I updated it to avoid closing this issue. |
Both PR googleapis/google-cloud-python#9337 deals with the race / hang by timing out the call to join the daemonized worker thread, but it doesn't really fix the issue here. I've filed grpc/grpc#20562 to ask that the |
UPDATE: PR googleapis/google-cloud-python#9337 was merged, but it's not immediately clear whether grpc/grpc#20562 was addressed. Pinging this issue so we consider it in future planning. |
Pinged grpc/grpc#20562 to see whether that was implemented, which would unblock us from implementing the rest of the solution here. |
It looks like grpc/grpc#20562 won't happen this quarter at least. We're waiting on that. |
Environment details
Any OS, any Python version, google-api-core==1.9.0.
Steps to reproduce
No actual steps, spotted this in bidi.py while working on a different issue. This is how the "Am I paused?" snippet in
BackgroundConsumer._thread_main()
looks like:The
_paused
state can be set / cleared by the pause() and resume() methods, respectively.If paused, the code snippet from above blocks at the
_wake.wait()
call, and is unblocked some time after _wake.notifyAll() is invoked in theresume()
method. Whenresume()
method notifies the waiting threads and releases the internal lock held by theself._wake
condition,_wake.wait()
tries to re-obtain that lock.Now, suppose that some other thread invokes the
pause()
method in the meantime, and the latter obtains theself._wake
's lock before_wake.wait()
can grab it. The_paused
flag will again be set toTrue
, and when_wake.wait()
finally acquires the lock and resumes, the code will invoke_bidi_rpc.recv()
in the paused state.For this reason the
self._paused
condition must be checked in a loop, and not in a singleif
statement, as the docs on threading.Condition state:How much of a problem this is, i.e. invoking
recv()
in a paused state?Edit:
The same can happen when exiting the
with self._wake
block. The lock gets released,pause()
can acquire it and set_self._paused
toTrue
, but_bidi_rpc.recv()
will nevertheless be called, because it is placed outside thewith
block.Code example
I created a demo script that demonstrates why
if <condition>
is not sufficient, and subject to race conditions (requires Python 3.6+). The "worker" thread waits until a shared global number is set to 10, while five "changer threads" randomly change that number, and notify other threads if they change it to 10.Sometimes the "worker" is invoked too late, and another "changer" thread changes 10 to something else, causing the "worker" to continue running when the target condition is no longer true. Replacing
if not <condition>
withwhile not <condition>
gets rid of the bug.The expected behavior is that the worker thread always prints out "I see the number 10" (the desired condition), and never "I see <non 10>". In other words, it should only proceed when the condition
number == 10
is currently fulfilled.In the
bidi.py
case, this would translate toBackgroundConsumer._thread_main()
only resuming whenself._paused == False
, and never resuming whenself._paused == True
.The text was updated successfully, but these errors were encountered: