-
Notifications
You must be signed in to change notification settings - Fork 206
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Long tasks lead to duplicate messages #593
Comments
Thanks for filing an issue! I will see if someone has time to look through your code to see where your issue might be coming from but in the meantime, there's a relatively recently published blog post tackling this subject here that includes some options for identifying where your duplicates are coming from as well as options for resolving. https://cloud.google.com/blog/products/data-analytics/handling-duplicate-data-in-streaming-pipeline-using-pubsub-dataflow |
@meredithslota thanks for the link! I went ahead and implemented a similar deduplication strategy. I used Firestore to monitor duplicate messages. I forgot to mention my duplicates all have the same messageIDs so it looks like my workers are not extending the ack deadline. This could be because my subprocesses are CPU-intensive, and pubsub isn't unable to send the extension request? |
Here are more notes and observations. My subscription acknowledgement duration is set to 600 seconds. I just observed a duplicate message being sent out. The original message was acknowledged 659.52 seconds after it was received. Immediately after acknowledgement (24 msec), a duplicate message was resent. Is there a case where a subscriber would send a bad ack signal back to pubsub that would cause pubsub to resend the message? Thanks, |
@meredithslota just following up on this to see if anyone has taken a look. I'm also happy to add logging and/or troubleshoot to see what the issue is. My main concern is that pubsub isn't behaving correctly on a machine under heavy CPU-loads. Specifically, if the lease extension fails to connect to pubsub provider, is there a retry? Or does one failed attempt immediately issue a new message? Thanks |
The library will keep modacking the message as long as the If a duplicate was given to your callback immediately after you acked the message, it's likely that that duplicate was already queued in the client library while the previous message was being processed. Your flow control settings allow just one message at a time, so this is possible. You can try playing with setting the But to mitigate duplicates, you need to either use the brand new exactly-once delivery feature, in preview, or to keep a look-aside DB like you're doing with Firestore. |
@pradn thanks for the tips. I'll test out both the min_duration_per_lease_extension and the new exactly-once feature. Will report back after running it for a bit. |
@jtressle, have you had a chance to test with the new settings? |
@pradn I'm about to start testing on our servers. I'll update this in the next couple of days. Thanks and sorry for the delay. |
Ok, let us know how it goes! |
Hi @pradn, I ended up changing the subscriber initialization to:
I also changed pubsub to only issue exactly one message. I did receive the following error on one of my runs. It happened after an ack(). I'll continue to test to see if it repeats. Thanks
|
jtressle@, this is a bug that has been fixed but hasn't been released yet. We can continue when that happens. I'll let you know. |
@pradn, what is the correct way in handling this so it doesn't kill the subscriber and instead just issues a warning? My code is above. I'm using PubSub in Kubernetes workers, and this bug kills my workers, and therefore stops my pipeline. If we can't ignore the error, is this an issue with the exactly-one-message implementation? Or with the min_duration_per_lease_extension? I can roll-back my changes until the fix is released, or until there's a branch I can test. Thanks for your help. |
We released the 2.12.0 version just now. Please try with that version. This will fix the bug in #593 (comment). Did you enable exactly-once delivery on your subscription? |
@pradn, I just started testing now and will update here after the weekend. I set So far, I've had none of the Thanks for your help. |
@pradn I wanted to report back. Version 2.12.0 reduced the number of NoneType errors, but I did get three exact errors. They occurred at the same place. The frequency has been reduced. Error: My python packages are:
I'm also still getting the same number of duplicates as I had before turning on the exactly-once delivery. I had one message run 3 times and another message run 5 times. The min and max lease extensions were set to 600, but I did see one message was sent twice 428 seconds apart. Is there anything else I can change? Also, is there a configuration change, or a earlier pubsub version I can use to isolate the errors above? I have another pub-sub system that has been running for years without this issue, but those tasks are a lot shorter. The pubsub errors unfortunate error out my long-running tasks and also making the Kubernete workers non-responsive because pubsub stops subscribing. Thanks for your help, |
Hi @pradn, I wanted to update you regarding this issue. After disabling the exactly once delivery, all the "AcknowledgeStatus.INVALID_ACK_ID" issues have gone away. Is this expected? Is there a way to have my subscriber immediately ACK the message, and then in a closure block perform my work. Once my work is done, I'll turn on the subscriber to wait for the next message. This will alleviate the issue of running pubsub with other CPU intensive tasks, and also running long-running tasks. Thanks and much appreciated, |
Hi @jtressle Please take a look at the sample EOD code here: https://cloud.google.com/pubsub/docs/samples/pubsub-subscriber-exactly-once
I believe this should help you with gracefully handling INVALID_ACK_ID errors. |
Some more context around why we want you to use |
UPDATE: I'm doing further testing and I've received 11
ORIGINAL: Many of the errors happen while I'm processing data, which returned back errors to our users. The reason for this is processing of the data is self-contained in the callback function. For context, I have another app that has been using Pub-Sub for years without issue in a GKE environment. The Pub-Sub failure rate is basically 0% as I have the exactly once delivery feature disabled. So I haven't had to add communication between Pub-Sub and my other processes. I have a few questions:
Thanks |
@pradn @acocuzzo I wanted to provide more insight based on my testing. When running on a compute instance, I see no I've updated my subscriber to ignore the invalid errors, but I'm getting multiple retries and multiple runs of the same message. Next steps are to verify GKE resources are configured correctly. If they are, I'll go back to my original configuration and work on starting/stopping Pubsub. Thanks, |
In the update in this comment are you saying that your program crashes when this error occurs? If you look in the code, we call future.result() and then catch the resulting exception (an AcknowledgeError). When caught, the exception just logs the error. So if your program is crashing / seeing an unexpected exception bubble up, it wasn't the intention and should be investigated.
Cloud Pub/Sub by default always delivers messages until they're acked. An ack is best-effort, so the system might lose the ack. So then you get the message again. Moreover, if a message isn't acked within the ack deadline, an ack expiration and re-delivery occur. Note your ack might race with the ack expiration time. Ack status for a message is stored in memory and periodically synced to disk on the server-side, so our server restarts may lose message ack statuses. So, there's several reasons for re-delivery. When a DLQ is enabled, we keep failure counts for when ack expirations occur and explicit nacks are sent for a message. If the failure count goes above a threshold, we move to to the DLQ topic. IF a DLQ is not enabled, we keep re-delivering messages until the message expires (based on the subscription settings, defaulting to 7 days).
It's up to you what you do with the exceptions returned by
It's possible there's environmental differences that account for that - maybe network latency. But you need to be able to deal with them occurring, in any environment.
This is expected. The "exactly-once" guarantee is that if you successfully ack the message, it wont come back. These invalid ack errors indicate an unsuccessful ack.
Ok, sounds good. Keep us posted. |
@pradn please see below:
What happens when I get the `def worker(
I updated my code to what was specified just above. Is this the correct way of handling the warning without stopping the callback?
My subscriber is set as:
My understanding is I should never get a failure or warning anytime before 600 seconds and should hold the lease until 7200. However, I'm getting duplicates (when DLQ is not enabled) and AcknowledgeError warnings way before 7200 seconds and before the minimum time of 600 seconds. How can I configure PubSub to adhere to these times?
Can you please provide example code on how to do this? The code I presented above ends the callback when I get the AcknowledgeError.
How should one go about handling the network latency? If my lease extensions are 600 seconds, shouldn't this be sufficient for PubSub to retry connections? Thanks |
Yes, you should be calling
It's possible to have a problem with the background modacking process. In your case, the background leaser would send a modack every 600 seconds up to 7200 seconds. If there's a modack error, we log the error. So when your code finishes working and calls
There isn't any code after the
I think setting a long lease timeout is the best you can do. I can't think of other mitigations. |
Hi, |
@ramirek I never got it to work properly. What I'm currently doing is turning off "exactly-once", living with the duplicates and handling them by logging messages in a database. I think the best solution would be to have pubsub immediately return an ACK, and then take that worker offline until the process has completed. |
Hi all, I think this may be an issue we're also dealing with. We take messages and run computations on batches of them before acking them. The batch calculation can be cpu-heavy and in the exact example I've been testing with takes about 9 minutes. We're getting tons of redeliveries. The exact test I'm running is using 127 clients with a subscription that has a backlog of 1M messages. The clients process message at about 3 messages per second and run for an hour. With this setup, I'm consistently seeing acks get lost (the workers record the This is adding a ton of required compute to get through a subscription and even when the clients run until they can't pull any more messages, messages get leftover anyways. Is this issue well-understood at all? Unfortunately we really can't use PubSub with our application like this. Happy to give more debugging details, I've turned on the PubSub client libraries logging along with timestamps so there's quite a lot of info to sift through. The modacks seems to be going out at the right time and there's no info reported about the AckRequests so I'm really not sure where the issue might lie. |
@acocuzzo Our team has upgraded the version to be v2.13.6 this week, but we do not see any ack/modack failure rates decrease. We also have forwarded more detail through the support ticket. |
@acocuzzo 2.13.6 has not improved the duplicates issue or the ACK retries. Two questions: (1) is it possible for your team to reproduce the error by running a long-running CPU intensive task on a Google Instance? Can something like this be added to the release tests? (2) Is there a recommended way to turn on/off a pub-sub worker? In our case, we don't need all the leasing logic that is causing this issue. I rather receive a message, immediately ACK that message, turn OFF Pub-sub on the worker to prevent pulling any more messages, have the worker run the long-running task, and then turn ON Pub-sub for the next message. Thanks. |
Hi @jtressle Thanks for checking on the new version. (1) I am currently working on a reproduction with processes of 1-2 hours, it would be helpful to know the specific memory/CPU usage for a proper reproduction. I'm not sure we could add this to the release tests but I can investigate what our options are. (2) If you don't want the leasing logic, you can try using SynchronousPull instead of StreamingPull aka AsynchronousPull, please see the docs for reference: https://cloud.google.com/pubsub/docs/pull#synchronous_pull One thing I would caution for SynchronousPull is that it is necessary to "overpull" meaning you should send many more pull requests than you might expect, as they will often return an empty response or fewer messages than requested rather than waiting for all of the messages requested. This is done to lower delivery latency. For example: https://medium.com/javarevisited/gcp-how-to-achieve-high-performance-synchronous-pull-with-pub-sub-and-spring-boot-12cb220c4d65 |
Hi @acocuzzo, This is excellent news! I've noticed it on both our servers that run longer processes. The servers in question are c2-standard-4 and c2-standard-30. The larger one runs a longer process, and has more frequent duplicates. For both servers, the cpu usage is basically 100% on all cores. One thing to note is that both servers are running in a GKE docker configuration, and I've set aside about 10% of the CPU for GKE related tasks. Thanks for the pull documentation. This might be a good alternative. I'll test implementing it and will report back. Thanks. |
@jtressle @YungChunLu @hjtran @ramirek |
@acocuzzo |
@hjtran Unfortunately without eod the INVALID_ACK_ID errors are not surfaced for us to check. |
Thanks @acocuzzo! I appreciate the help. FWIW, I've also been talking with our GCP account manager about this for a few months in case you want to be linked to that, we've gone back and forth trying to debug this for a while. If that's helpful, I can include you on the email chain. |
Hi, |
I'm still facing the same issues, so no news I don't think. There have been a couple new releases for the python client which I was advised to upgrade and rerun my load tests but I'm still seeing lots of expired acks |
Exactly the same problem. I never faced the problem before switching to "exactly once" delivery system. |
Ah I see the issue without exactly-once. I'm about to run a LoadTest with exactly-once to see if that fixes anything |
Oh, interesting. I was sure that it was because of this option that this error had arrived on our application. |
@acocuzzo any developments? |
@Gwojda I would highly recommend reaching out via Cloud support for help on your individual issue, as causes differ case to case. |
Hi, |
@acocuzzo I initially submitted a ticket with cloud support right after opening this ticket. Their recommendation was to seek support through this ticket, as Pub/Sub was out of their expertise. Regarding issues, I have two that are both related to long-running tasks. With EOD, I get the INVALID_ACK_ID issue. With EOD turned OFF, I get duplicate messages being sent before the lease is done. Both these issues prohibit me from running PubSub reliably. At the end, I'm using a database to track duplicate messages, which works except in the case of some race conditions. First, I think a solution where we can run EOD and a simplified leasing mechanism. The current leasing logic fails and is placing optimization over reliability. If INVALID_ACK_ID is given only for past expired messages, why are any messages expiring before my timeout of 7200 seconds? Alternatively, please just provide a mechanism where we can ignore ALL INVALID_ACK_ID messages. I've yet to implement your SynchronizedPull recommendation, but I believe this may be the best solution for my circumstance (long-running tasks, and one message at a time). Thanks, |
|
@Gwojda how are you ignoring the INVALID_ACK_ID? In my case, it disrupted my process I was running. Thanks |
@Gwojda @jtressle @hjtran @YungChunLu In particular, this should reduce:
Currently, the leasing behavior in the library still creates new threads, but not new processes, as communication is needed between threads, and therefore we have asynchronous behavior, but not concurrency. In order to optimize the leasing further, we will need to change the behavior significantly in order to allow for concurrency, but this fix will require more significant design changes. In particular, we use the python grpc library, so we are limited in our ability to parallelize calls to the service within the same client. If your subscription requires high throughput, I either recommend (1) increasing the number of clients and reducing flow control settings or (2) switching to a library that offers thread parallelism Thanks everyone for their input and patience. |
@jtressle
|
Thanks for the detailed explaination. Will give the new client a try |
Adding a blocked tag pending some exactly-once delivery server-side changes. |
Closing this bug as many of the issues related have been fixed, both client and server-side:
Remaining issues:
|
Hi,
I'm having an issue where I have a long process (up to 80 minutes) running on a Kubernetes docker instance. The instance is running Ubuntu 20.04, and I'm using Python 3.8.10. The docker container runs a python worker script, which runs a subprocess. The subprocess is multi-threaded and can use all threads during some CPU intensive tasks.
I'm getting a lot of duplicates (about 5 to 10 duplicates). This is repeatable and probably due to the intense CPU usage. What is the correct way to handle this? Thanks in advance.
My pip versions are:
My worker code is similar to this:
Thanks in advance,
The text was updated successfully, but these errors were encountered: