-
Notifications
You must be signed in to change notification settings - Fork 215
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Lost node due to raft issue #190
Comments
Yes, please, send me the data files. |
Oh I see you did already, ok then, I'll have a look. |
The error message that raft reports in this case is wrong. I fixed it in canonical/raft#96, so now you'll get:
which means that there's a mismatch between the number of raft log entries encoded in the segment filename and the number of raft log entries actually contained in the segment file. It's probably a bug in the raft library itself, although I thought version 0.9.12 had a fix for it. Just to double check:
|
What is puzzling is that the There are also several other segments past It's the first time I see something like this, I'm wondering if recent changes introduced a regression, although in my own tests I've not hit it. |
This environment has been very chaotic, so it's hard to say anything for certain but I believe I was running libraft v0.9.9 during which time I experienced a lot of failures. While I attempt to call node.Close() there's no real guarantee that it completes before the process terminates. So this node was often stuck in reboot loops where the dqlite server would start and then the dqlite client would encounter an error and crash. So a lot of improper shutdowns. Then I updated to libraft v0.9.12 and libco, sqlite, and dqlite as of 12/12. The system seemed more stable after that. Then I was testing more failure scenarios by doing sigterm (but again I don't know if node.Close() would really exit fully) that's when I got this error. |
I've reproduced this error 2 other times now. I can't consistently do it, but it seems if I just keep randomly restarting things I eventually hit this. |
Any chance you could provide me with the code and the steps to reproduce exactly what you are doing to trigger the issue? |
I'll see if I can narrow it down to something small and reproducible. It might help me a bit to narrow this down if you could explain what the different data files are and when/how are they written to. Are these numbered files the current log? Are they append only or do you rewrite them on compaction? When does compaction occur? I'm guessing I'm killing the server at the exact right time during some initialization or data housekeeping task. |
The semantics of those files is basically the same as the one found in LogCabin and etcd, see this docstring in the raft library: https://github.com/canonical/raft/blob/master/include/raft/uv.h#L24 To reply to your questions: yes the numbered files are the current log, they are append-only and might be deleted upon compaction, compaction occurs approx every 1k transactions (currently there's no API to tweak that, but it will come). Even if you don't find something small, it's okay for me, I can find my way, as long as I can build from source and run the thing. |
At least according to the state that you sent me by email, the problem seems to have occurred some time before asking to terminate the process, so it doesn't seem to be caused by the act of terminating it. |
I'll try to give you a reproducible setup. Right now I'm working off of a local dev branch but I'll commit it soon. |
I'm still working on this, honestly might take a couple days because I'm blocked on working on k8s v1.17 upgrade. |
Any news on this one here? Prevents multi-master from being a real possibility within k3s |
Any hint about how to reproduce it? |
I'm able to reproduce this scenario while standing up 3 k3s nodes managed via k3os pretty reliability as the cluster will regularly get out of sync and become unstable almost on boot, unsure what's happening there but I get the same |
@Kampe that's very good, it sounds I should be able to reproduce that too. Care to provide a detailed list of steps that are likely to lead to the issue? Being able to see it on my side is probably the only effective way we're going to nail it. |
That works, here's what I do: Create/download k3os 0.9.0 ISO and rip it to a usb device (can probably use virtualbox here I'd assume although the most interesting behavior I've seen is getting it to run on real devices) Using cloud init files, standup your 'primary' node, the first of your nodes need to stand up and initialize the cluster to do so I use this in my cloud init file:
I then stand up two more 'secondary' nodes and join them to the cluster with this cloud init:
Once all three are up and clustered, it won't take long for them to get out of sync, if you'd really like to see some interesting behavior, turn off the "primary" node and turn it back on and let it try to rejoin the cluster, however I've had successes with bringing this error up just by just letting leader election take place and waiting for control plane to swap nodes. I also lay down a handful of manifests with cloud init as well to get things going when my cluster initializes. Let me know if those may be helpful for you as well. |
What version of go-dqlite/dqlite/raft is used exactly in the 0.9.0 release? @ibuildthecloud |
I assume they get "out of sync" only as consequence of some sort of restart, or does it happen also when no service gets restarted?
I'll try that. @ibuildthecloud: if you are not doing this already, one thing that will massively improve things here is to use the "leadership transfer" API recently added to (go-)dqlite. Essentially when you want to cleanly shutdown the leader you first try to transfer leadership to another node.
What are these cloud inits for? Isn't k83os supposed to do everything is needed? |
The inits are for my own configuration on cluster init, we lay down a handful of operators and set up some "site specific" secrets into the I only stand these up as part of a custom ISO pointing to cloud inits, which also laying down a handful of other things, so I haven't tested HA k3s without having a handful of services and deployments layed down on the cluster, but as soon as a cluster is up and all three nodes join within 10 minutes raft gets pretty angry and starts with the logging above. |
any workaround for this issue ? anyway to make the data in sync manually ? |
For the specific error that was reported here, you can try to delete the last segment in the log (the file For syncing nodes manually, you can pick the node whose data directory has the longest log, and copy it to the other nodes. |
thanks, that works, what is the root cause ? |
I'm not entirely sure of what happened, I'd expect some sort of crash or unclean shutdown. Any more clue from you guys about that is welcome :) |
my env is , three vms created by multipass , shutdown the host without stopping k3s . |
Things should improve quite a bit with canonical/raft#122 which landed against the C raft library. I tested a 3-node Kubernetes cluster backed by 3 kine instances clustered together, and observed crashes and issues that led to the above PR and to some other PRs in go-dqlite. After these fixes I haven't noticed issues anymore and the system seems stable. I'm still not 100% sure that the bug reported in this issue is fixed by the above changes, but it might well if it was some side-effect of those issues that I've seen. |
Seem this is recurrently happening so i will try with external db as current state is pretty unstable. |
I've observed the same error when running k3s v1.18.15+k3s1 |
Yep ^^ k3s has removed any use of dqlite in the latest images and replaced with etcd. |
I'll close this one due to inactivity. |
I've been testing failure scenarios and I have a node that can't start because it gets the following error on startup:
All I've been doing is randomly killing the server.
Let me know if you want the data files.I emailed the data files for this node.The text was updated successfully, but these errors were encountered: