-
-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nats server pod gets stuck in KV stream catchup after restart #5205
Comments
Observing the same issue on latest NATS version as well. Tested versions:
We have ~20 streams each of them receiving multiple writes.
Workaround we found so far to get node to healthy state without data loss is to temporarily change replication factor for affected stream (from 3 to 1), then restore original replication factor value.
This way affected node does not need to host this stream, so node can finish initialisation. |
Same issue, nats-2.10.12. I went from 1 replica to 2 replicas for my stream, it helped. |
We have experienced the same issue in multiple environments for interest based Jetstream (not KV). Environment information
SymptomWe noticed that the symptom seems to be caused by the
ReproduceIt's relatively easy to reproduce this issue in a 3-replica Kubernetes hosted NATs cluster with Jetstream enabled.
Actual behaviourThe Expected behaviourThe MitigationAs @bondar-pavel suggested, editing the stream replicas to 1 and then back to 3 seems to be the only mitigation without losing data or causing downtime. NotesStream configuration{
"config": {
"name": "<redacted>",
"subjects": [
"<redacted>"
],
"retention": "interest",
"max_consumers": -1,
"max_msgs_per_subject": -1,
"max_msgs": -1,
"max_bytes": -1,
"max_age": 0,
"max_msg_size": -1,
"storage": "file",
"discard": "old",
"num_replicas": 3,
"duplicate_window": 10000000000,
"sealed": false,
"deny_delete": false,
"deny_purge": false,
"allow_rollup_hdrs": false,
"allow_direct": true,
"mirror_direct": false
},
"created": "2024-03-28T17:59:55.75752643Z",
"state": {
"messages": 1372,
"bytes": 127596,
"first_seq": 11761831,
"first_ts": "2024-03-28T19:27:57.692021989Z",
"last_seq": 11763599,
"last_ts": "2024-03-28T19:27:58.395525207Z",
"num_deleted": 397,
"num_subjects": 1,
"consumer_count": 1
},
"cluster": {
"name": "kubernetes-nats",
"leader": "kubernetes-nats-2",
"replicas": [
{
"name": "kubernetes-nats-0",
"current": true,
"active": 196170
},
{
"name": "kubernetes-nats-1",
"current": true,
"active": 220688
}
]
},
"ts": "2024-03-28T19:27:58.395990925Z"
} Consumer configuration{
"stream_name": "<redacted>",
"name": "<redacted>",
"config": {
"ack_policy": "explicit",
"ack_wait": 30000000000,
"deliver_policy": "new",
"durable_name": "<redacted>",
"name": "<redacted>",
"max_ack_pending": 20000,
"max_deliver": 5,
"max_waiting": 512,
"replay_policy": "instant",
"num_replicas": 0
},
"created": "2024-03-28T17:59:56.100249122Z",
"delivered": {
"consumer_seq": 38939132,
"stream_seq": 12980344,
"last_active": "2024-03-28T19:35:54.238507718Z"
},
"ack_floor": {
"consumer_seq": 38933971,
"stream_seq": 12978586,
"last_active": "2024-03-28T19:35:54.197600505Z"
},
"num_ack_pending": 1545,
"num_redelivered": 1121,
"num_waiting": 4,
"num_pending": 0,
"cluster": {
"name": "kubernetes-nats",
"leader": "kubernetes-nats-2",
"replicas": [
{
"name": "kubernetes-nats-0",
"current": true,
"active": 13341
},
{
"name": "kubernetes-nats-1",
"current": true,
"active": 146374
}
]
},
"ts": "2024-03-28T19:35:54.238705331Z"
} |
We have some improvements coming in 2.10.14 around this which hopefully can help. If you are feeling adventuresome feel free to grab a binary from the
|
I've been playing with @derekcollison 's dev build from
As shown in the graph, the
Below are some error messages from the first dev build pod during rollout.
|
We will start cleaning up and cherry picking into main and then into 2.10.14. Thanks for checking though! |
Note that 2.10.14 has now been released (and now 2.10.15 should be very soon) |
(copying the reply from nats slack) I just repeated my chaos tests with 2.10.14, basically having a fast moving stream with a publisher (nats pub ) and an interest based consumer, while rolling restarting my 3-replica statefulset of NATs. Overall it’s been a LOT better. previously, almost 100% of the time the first_seq number of the fast moving stream would go out of sync, particularly the nats-2 instance that were restarted the first would almost always have its Note that the first rolling restart did reproduce the Overall I’m really happy with 2.10.14. If it goes well it would address a major operation pain point of Jetstream. I’ll deploy it to prod soon and let you know if anything comes up. Thanks again for the hard work! |
After restart 1 by 1 node in cluster on 2.10.14 we lost 60% kv storage. Reproduce on 1 of 10 clusters after restart.
|
What does stream info for the underlying KV show? |
01 node
02 node
|
You can look on last 10k lines log 01 STOP DELETElook on "num_deleted": 59009,
02
03
config
|
We fixed an issue with discard new that is in main and will be in 2.10.15, however your KV may have already had inconsistencies so when a new leader was elected it used its state. The way to sync a known good state is make sure that replica is leader, and scale down to 1 then back up. Once upgraded to 2.10.15 the issue should not occur anymore. |
we create new clear cluster
reproduced... |
Could you retest from top of main branch or nightly builds? |
What sync interval is the system configured with? Woukld you be open to a video call? |
Hi! Sorry for the delayed answer. If it is possible, let's organize a call next week. We continue testing using the latest version and trying to find the roots of this behaviour. |
Looping in @neilalexander and @Jarema and @wallyqs who can coordinate. |
Hey @grrrvahrrr! |
I experienced this same issue last night after a node pool upgrade, I'd be quite Interested in a resolution here. Was anything gleaned from the aforementioned video call (assuming it's happened)? |
@Marquis42 what server version were you using to do the node pool upgrade? Any logs you could share? We did fix a stuck catchup situation hence server version is helpful. |
We're running 2.10.14 at the moment.
…On Thu, Jun 6, 2024 at 7:32 PM Derek Collison ***@***.***> wrote:
@Marquis42 <https://github.com/Marquis42> what server version were you
using to do the node pool uphrade?
—
Reply to this email directly, view it on GitHub
<#5205 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAG6PH66ZLOHKCNSLWVAGVDZGDWSPAVCNFSM6AAAAABER4ZOHGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNJTGU3DEOJUGU>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
I believe it was fixed in 2.10.16. We also are staging 2.10.17 as RC1 now if you want to test.. |
That's great news, thank you! I'm not sure if I can reproduce as it's only
happened the one time, but I'll be sure to upgrade at the earliest possible
time.
…On Thu, Jun 6, 2024 at 7:42 PM Derek Collison ***@***.***> wrote:
#5454 <#5454>
—
Reply to this email directly, view it on GitHub
<#5205 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAG6PH64JB5WU7U6E7SZWMLZGDXU5AVCNFSM6AAAAABER4ZOHGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNJTGU3TKMJZGE>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
not fixed in 2.10.17 and 2.10.18 |
@bfoxstudio Do you have clear instructions we could use to recreate what you are seeing on our side? |
What we are doing is simple - tested for both docker and k3s:
|
How big is each message? |
How big is each message? 0,33 kb How many different keys / subjects? 22.5k Does the KV have history? no What do you mean by drain the node? we switch node availability to Drain in Portainer to simulate server shutdown. here is more info https://docs.docker.com/engine/swarm/swarm-tutorial/drain-node/ stream info and kv info for the stream with desync
|
@wallyqs could you look into this a bit and see what we can find? |
@wallyqs Hi! any news on the issue? |
Thanks for the information about this issue, we have a test reproduction for it and are currently investigating how to address. |
We solved it for now following the steps here: #5205 (comment) |
It works for us too, but we would like more stability)) |
If 1 server cannot handle the load - reducing the number of replicas can lead to complete data loss . This is not a solution to the problem, but it does help fix the cluster functionality. |
Sometimes during a stream catchup after a restart, when applying entries the clfs could have been bumped due to msg not found errors from likely deleted messages, causing a stream replica to remain out of sync. Fixes #5205 --------- Signed-off-by: Waldemar Quevedo <wally@nats.io> Signed-off-by: Derek Collison <derek@nats.io> Co-authored-by: Derek Collison <derek@nats.io>
@katrinwab we have some additional protections going into 2.10.21 release that are landing on main but these are for restoring streams on server start or stream restore. Is this after restart or just during normal operations? |
we had problems with the network between nodes ( |
@derekcollison yesterday I updated to 2.10.21. but today I got the problem again |
@katrinwab would need quite a bit more information. Are you a Synadia customer? Is this production impacting? |
We have the same problem.
Here is the stream info:
It's quite critical for us as we have the problem if we just scale our cluster on production. |
Observed behavior
I discovered this when exploring key value and testing resillience and performace among other things.
I used a helm install in kubernetes.
While updating a key value bucket (in memory, with replication 3) with about 1000 updates/s I restarted one pod. It never came back up and it got stuck in a stream catchup state (waited over 20 minutes). Subsequent restarts did not resolve the issue. It only got resolved after the stream was deleted (or in subsequent tests when all nats pods where stopped).
See also this thread in slack: https://natsio.slack.com/archives/C06EN6HCWE4/p1710151450266159
Stream:
Log snippet:
Expected behavior
I expected the restarted pod to catch up and come back into the cluster.
Server and client version
Server version: 2.10.11
Client version (golang client): v1.32.0
Host environment
Both server and client was running in managed Kubernetes in Google Cloud, using Google Filestore as perstistence layer.
Steps to reproduce
It is a bit hard to reproduce, I only manged to reproduce it about 1/5 tries.
What I did was:
I also tried to reproduce using a file store for the stream, but didn't manage to do it within 10 tries or so.
The text was updated successfully, but these errors were encountered: