Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lost node due to raft issue #190

Closed
ibuildthecloud opened this issue Dec 11, 2019 · 30 comments
Closed

Lost node due to raft issue #190

ibuildthecloud opened this issue Dec 11, 2019 · 30 comments

Comments

@ibuildthecloud
Copy link

ibuildthecloud commented Dec 11, 2019

I've been testing failure scenarios and I have a node that can't start because it gets the following error on startup:

raft_start(): io: load closed segment 0000000010000470-0000000010000655: 

All I've been doing is randomly killing the server. Let me know if you want the data files. I emailed the data files for this node.

@freeekanayaka
Copy link
Contributor

Yes, please, send me the data files.

@freeekanayaka
Copy link
Contributor

Oh I see you did already, ok then, I'll have a look.

@freeekanayaka
Copy link
Contributor

The error message that raft reports in this case is wrong. I fixed it in canonical/raft#96, so now you'll get:

raft_start(): io: load closed segment 0000000010000470-0000000010000655: found 185 entries (expected 186)

which means that there's a mismatch between the number of raft log entries encoded in the segment filename and the number of raft log entries actually contained in the segment file.

It's probably a bug in the raft library itself, although I thought version 0.9.12 had a fix for it. Just to double check:

  • Was the server running version 0.9.12, even before you noticed this error? This is to rule out the possibility that you stopped a server running 0.9.11 or lower, which caused this bug, then you upgraded to 0.9.12 and you noticed it after it had actually occurred.

  • How are you "randomly killing" the server exactly? With kill -KILL <pid>? Or with some kind of graceful shutdown, e.g. catching SIGTERM/SIGINT and starting some shutdown sequence that eventually calls dqlite.Node.Close() and waits for it to return?

@freeekanayaka
Copy link
Contributor

freeekanayaka commented Dec 12, 2019

What is puzzling is that the 0000000010000470-0000000010000655 segment is not the last one that was written, which is I what I would expect if the bug here was the one supposedly fixed in 0.9.7.

There are also several other segments past 0000000010000470-0000000010000655, which suggest that the bug occurred somewhere during normal operation, not at shutdown, and then the issue manifested itself only when the server restarted.

It's the first time I see something like this, I'm wondering if recent changes introduced a regression, although in my own tests I've not hit it.

@ibuildthecloud
Copy link
Author

This environment has been very chaotic, so it's hard to say anything for certain but I believe I was running libraft v0.9.9 during which time I experienced a lot of failures. While I attempt to call node.Close() there's no real guarantee that it completes before the process terminates. So this node was often stuck in reboot loops where the dqlite server would start and then the dqlite client would encounter an error and crash. So a lot of improper shutdowns. Then I updated to libraft v0.9.12 and libco, sqlite, and dqlite as of 12/12. The system seemed more stable after that. Then I was testing more failure scenarios by doing sigterm (but again I don't know if node.Close() would really exit fully) that's when I got this error.

@ibuildthecloud
Copy link
Author

I've reproduced this error 2 other times now. I can't consistently do it, but it seems if I just keep randomly restarting things I eventually hit this.

@freeekanayaka
Copy link
Contributor

Any chance you could provide me with the code and the steps to reproduce exactly what you are doing to trigger the issue?

@ibuildthecloud
Copy link
Author

I'll see if I can narrow it down to something small and reproducible. It might help me a bit to narrow this down if you could explain what the different data files are and when/how are they written to.

Are these numbered files the current log? Are they append only or do you rewrite them on compaction? When does compaction occur? I'm guessing I'm killing the server at the exact right time during some initialization or data housekeeping task.

@freeekanayaka
Copy link
Contributor

The semantics of those files is basically the same as the one found in LogCabin and etcd, see this docstring in the raft library:

https://github.com/canonical/raft/blob/master/include/raft/uv.h#L24

To reply to your questions: yes the numbered files are the current log, they are append-only and might be deleted upon compaction, compaction occurs approx every 1k transactions (currently there's no API to tweak that, but it will come).

Even if you don't find something small, it's okay for me, I can find my way, as long as I can build from source and run the thing.

@freeekanayaka
Copy link
Contributor

At least according to the state that you sent me by email, the problem seems to have occurred some time before asking to terminate the process, so it doesn't seem to be caused by the act of terminating it.

@ibuildthecloud
Copy link
Author

I'll try to give you a reproducible setup. Right now I'm working off of a local dev branch but I'll commit it soon.

@ibuildthecloud
Copy link
Author

I'm still working on this, honestly might take a couple days because I'm blocked on working on k8s v1.17 upgrade.

@Kampe
Copy link

Kampe commented Mar 1, 2020

Any news on this one here? Prevents multi-master from being a real possibility within k3s

@freeekanayaka
Copy link
Contributor

Any hint about how to reproduce it?

@Kampe
Copy link

Kampe commented Mar 1, 2020

I'm able to reproduce this scenario while standing up 3 k3s nodes managed via k3os pretty reliability as the cluster will regularly get out of sync and become unstable almost on boot, unsure what's happening there but I get the same raft_start(): io: load closed segment 0000000010000470-0000000010000655: found 185 entries (expected 186) log output from k3s constantly on my master nodes.

@freeekanayaka
Copy link
Contributor

@Kampe that's very good, it sounds I should be able to reproduce that too. Care to provide a detailed list of steps that are likely to lead to the issue? Being able to see it on my side is probably the only effective way we're going to nail it.

@Kampe
Copy link

Kampe commented Mar 2, 2020

That works, here's what I do:

Create/download k3os 0.9.0 ISO and rip it to a usb device (can probably use virtualbox here I'd assume although the most interesting behavior I've seen is getting it to run on real devices)

Using cloud init files, standup your 'primary' node, the first of your nodes need to stand up and initialize the cluster to do so I use this in my cloud init file:

k3os:
  token: secret
  k3s_args:
  - server
  - "--disable-cloud-controller"
  - "--disable-network-policy"
  - "--cluster-init"

I then stand up two more 'secondary' nodes and join them to the cluster with this cloud init:

k3os:
  server_url: https://k3os-primary:6443
  token: secret
  k3s_args:
  - server
  - "--disable-cloud-controller"
  - "--disable-network-policy"

Once all three are up and clustered, it won't take long for them to get out of sync, if you'd really like to see some interesting behavior, turn off the "primary" node and turn it back on and let it try to rejoin the cluster, however I've had successes with bringing this error up just by just letting leader election take place and waiting for control plane to swap nodes.

I also lay down a handful of manifests with cloud init as well to get things going when my cluster initializes. Let me know if those may be helpful for you as well.

@freeekanayaka
Copy link
Contributor

What version of go-dqlite/dqlite/raft is used exactly in the 0.9.0 release? @ibuildthecloud

@freeekanayaka
Copy link
Contributor

Once all three are up and clustered, it won't take long for them to get out of sync

I assume they get "out of sync" only as consequence of some sort of restart, or does it happen also when no service gets restarted?

if you'd really like to see some interesting behavior, turn off the "primary" node and turn it back on and let it try to rejoin the cluster, however I've had successes with bringing this error up just by just letting leader election take place and waiting for control plane to swap nodes.

I'll try that.

@ibuildthecloud: if you are not doing this already, one thing that will massively improve things here is to use the "leadership transfer" API recently added to (go-)dqlite. Essentially when you want to cleanly shutdown the leader you first try to transfer leadership to another node.

I also lay down a handful of manifests with cloud init as well to get things going when my cluster initializes. Let me know if those may be helpful for you as well.

What are these cloud inits for? Isn't k83os supposed to do everything is needed?

@Kampe
Copy link

Kampe commented Mar 2, 2020

The inits are for my own configuration on cluster init, we lay down a handful of operators and set up some "site specific" secrets into the /var/lib/rancher/k3s/server/manifests/ directory which allows us the apply everything in the directory at runtime. (Previously only been running with a single master and a handful of agent nodes.)

I only stand these up as part of a custom ISO pointing to cloud inits, which also laying down a handful of other things, so I haven't tested HA k3s without having a handful of services and deployments layed down on the cluster, but as soon as a cluster is up and all three nodes join within 10 minutes raft gets pretty angry and starts with the logging above.

@thomasdba
Copy link

any workaround for this issue ? anyway to make the data in sync manually ?

@freeekanayaka
Copy link
Contributor

For the specific error that was reported here, you can try to delete the last segment in the log (the file 0000000010000470-0000000010000655).

For syncing nodes manually, you can pick the node whose data directory has the longest log, and copy it to the other nodes.

@thomasdba
Copy link

thanks, that works, what is the root cause ?

@freeekanayaka
Copy link
Contributor

I'm not entirely sure of what happened, I'd expect some sort of crash or unclean shutdown. Any more clue from you guys about that is welcome :)

@thomasdba
Copy link

my env is , three vms created by multipass , shutdown the host without stopping k3s .

@freeekanayaka
Copy link
Contributor

Things should improve quite a bit with canonical/raft#122 which landed against the C raft library.

I tested a 3-node Kubernetes cluster backed by 3 kine instances clustered together, and observed crashes and issues that led to the above PR and to some other PRs in go-dqlite. After these fixes I haven't noticed issues anymore and the system seems stable.

I'm still not 100% sure that the bug reported in this issue is fixed by the above changes, but it might well if it was some side-effect of those issues that I've seen.

@aktaksh
Copy link

aktaksh commented Nov 24, 2020

Seem this is recurrently happening so i will try with external db as current state is pretty unstable.

@tceduard
Copy link

tceduard commented Feb 4, 2021

I've observed the same error when running k3s v1.18.15+k3s1
the error has disappeared in k3s v1.19.5+k3s1

@Kampe
Copy link

Kampe commented Feb 10, 2021

Yep ^^ k3s has removed any use of dqlite in the latest images and replaced with etcd.

@MathieuBordere
Copy link
Contributor

I'll close this one due to inactivity.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants