-
Notifications
You must be signed in to change notification settings - Fork 227
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Subscription streaming pull requests are disconnecting after 15 minutes #1135
Comments
We are still seeing this issue, with the addition of the following being logged to console:
|
This comment has been minimized.
This comment has been minimized.
Yikes... so the new behaviour is that it just sort of silently stops receiving messages? Do you happen to have grpc debug output? You can get that by setting the environment variables:
|
This comment has been minimized.
This comment has been minimized.
And one more question, does your use case involve Docker? (GKE or whatnot...) I've found that that seems to be a commonality for this problem. |
I think this is what we have been experiencing for the past few years, but it went from @feywind this issue is happening with the real PubSub service as well, same conditions. It appears that a quiet connection is closed. It does not happen in production for us (at least not yet) because it stays busy. |
This comment has been minimized.
This comment has been minimized.
My vague recollection of debugging one of the linked issues is that I only saw this happening (and it was really reproducible) when there was a Docker network boundary involved. So emulator in a container, client on GKE, or something like that. I worked with the grpc team to try to figure out why, and I don't think we ever found anything super useful in the grpc traces either. :( We thought it was at least worked around by way of papering over the reconnects, but it sounds like that's not happening either. So I see two issues:
For this one, I figured that letting it retry when disconnected would at least roll us back to where we were. It seems like maybe that's not working though...
This one I'm less sure about, but I can get my repro case going again and bug the grpc team to see if they can find anything else here. I still suspect weirdness between the Node HTTP/2 stack and Docker iptables rules used to do container isolation, but that's admittedly just a hunch so far. |
This comment has been minimized.
This comment has been minimized.
Thanks! We can also work out a non-GitHub-public-issue way to get the logs over, if that helps. |
This comment has been minimized.
This comment has been minimized.
@murgatroid99 might know the |
grpc-js uses the same environment variables, and |
If you want to narrow the trace output, the output format from grpc-js is |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
Grasping for straws here but I wonder if this is related: https://www.hostedgraphite.com/blog/deadlines-lies-and-videotape-the-tale-of-a-grpc-bug |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
(Still marked |
Linked to the meta-issue about transport problems: b/242894947 |
To add a data point here. We're seeing something similar happen in our setup:
Normal operationsThe system performs as expected for hours on end, processing keeps up with publishing: Backlog accumulatingHowever, zooming out we observe longer periods of time where the subscriber pool doesn't keep up: Throughput RCAWe used to have CPU bottlenecks in our Postgres instance, but have successfully solved those by fronting with Redis. Redis was never a bottleneck, but we did implement some in-memory caching to alleviate the load in redis as well And to close the loop, our GKE pool is autoscaling successfully, so no CPU bottlenecks there either All this to say that we're pretty sure that the bottleneck lies elsewhere. Subscribers stop receiving messagesWe added some custom logging to our subscribers, which accumulates counts on how many messages are processed at 10 sec intervals. When operating normally, it looks like this: We started noticing that when a backlog accumulates, multiple subscribers are not processing any messages: This brings down the processing rate of the subscriber pool, causing the backlog to grow. We added listeners for WorkaroundAs a hail-mary, we decided to implement the workaround mentioned here - basically turning it off and on again 🎉. If a subscriber is detected to not receive any events, we This has been very successful so far: It has not fixed the problem 100%, the fix was deployed yesterday, nov 23. We still observe some backlog spikes today (nov 24), but it does seem that the system is able to recover way more quickly than before the fix. TheoryI'm not familiar with the inner workings of pubsub, but I have one theory of what might cause this:
Local debuggingTo debug this, I ran one subscriber instance on my local machine. Note: in contrast to the production GKE deployment, this local Node.js process did not run within docker. I observed this subscriber receiving messages, then dropping down to 0 messages over 10-sec intervals. That triggered the restart a couple of times, then at some point it did start receiving messages again. This pattern repeats a couple times. If my theory is half-correct, I could imagine this subscriber connecting to an empty shard a couple of times, until at some point the restart allowed it to connect to a live shard. Hope this helps, FWIW we have a running support contract with GCP and I'm open to hop on a call and poke at our system with somebody from GCP. |
@edorivai Thanks for the really detailed comment! I think we're seeing this in situations other than just ordering keys, but let me tag in @kamalaboulhosn in regards to the service-side speculation. (Kamal: look for the Theory heading above) Most of the related team is off for US holidays right now, so it may be a bit. We still haven't been able to find the fire to go with the smoke on this issue, so I am starting to think maybe we should just temporarily implement the workaround everyone is using anyway ("have you tried turning off and back on again?" :) I don't think it's something we should encourage for regular use or leave in there indefinitely, but I also don't like users having to deal with it for so long. |
This isn't exactly how Pub/Sub's sharding of ordering keys works on the subscribe side. The only trigger that would cause changing the association of ordering keys with a subscriber would be the addition of new subscriber clients that need to receive messages. That would result in no messages being sent out on some set of keys until the outstanding messages for those keys are acknowledged. However, this is not tied to the publishing of messages, so the shard does not need to be empty. It is possible for subscribers to be assigned a set of keys that have no publishes as we don't balance keys based on load to subscribers. This would depend greatly on the number of subscriber clients and the diversity of ordering keys used. |
@kamalaboulhosn thank you for that context! We explicitly tested whether auto-scaling (changing the no. of subscribers) caused these throughput issues. We basically ran our GKE workload on a fixed number (12) of pods. Even under those constraints, we saw a lot of periods where many (more than 50% of the pool) subscribers would receive no messages. Additionally, in our case, the number of ordering keys are fairly constant over time. |
This issue has covered a lot of different causes and investigations that are not entirely related including issues around ordered, unordered, and exactly once subscriptions. Going forward, if anyone is still experiencing issues, please enter a support case. Thanks! |
why was this closed ? i mean closing and opening subscription every 15 minutes does work but as a workaround - I am very sceptic this is the good practice. |
I don't really get why was this closed. This is still an issue, I'm getting the same error using
I have all listeners properly implemented (on 'error', 'close') yet, when the error happens is not being handled by any of them, and just silently stops listening. I think turning it on and off again is not a very clean solution for this, specially in production. |
I did a lot of investigation into this issue and decided to open up a follow-up issue with my findings: #1885 |
also an issue with dotnet client PubSub.V1 but only with OrderingKey (we have only a few) - so it's gotta be a core issue related to that. I think this needs to be escalated to the core pubsub not client libs. |
@philvmx1 we also use the Dotnet client but no ordering keys, same issue for us there. At least in Dotnet we can detect it and reconnect. |
@jeffijoe can you share a snippet of how you are doing that please? |
Looking at source code, I wonder if the server is hanging up causing IsRpcCancellation thus breaking the while loop.
WDYT? |
I just wrap the subscribe call in a |
any updates? I've been searching for months for a solution. The pubsub just stops listening, I've tried multiple service modules, I have a barebones test going and it drops within 6-8 hours every time. Nothing in logs or syslog. It just simply stops listening. I don't have many more moves before we have to abandon pubsubs Worse fear of these services, just simply doesn't work with no clues or solutions |
I also experienced random drops after a while. |
Incase this helps someone, I was finally able to fix this by creating a separate package.json with only the needed packages for my services. Somewhere in dependencies of dependencies there was a package causing it to stop working. |
@lando2319 , very interesting! Would you be able to share a |
This started out of this issue: #979
I just reverted the change that exposed the stream cancellations, but they are still happening under the covers. I think it would behoove us to figure out why. The commonality that I've seen so far is that it generally involves crossing a Docker networking boundary. A working reproduction is available on that issue above, though it only seems to happen on grpc-js, so it might also have something to do with the Node HTTP/2 stack.
Several question marks here, and I just don't want to lose track of that investigation, even if we fix the user breakage for now.
The text was updated successfully, but these errors were encountered: