Lost node due to raft issue #190

ibuildthecloud · 2019-12-11T22:15:31Z

I've been testing failure scenarios and I have a node that can't start because it gets the following error on startup:

raft_start(): io: load closed segment 0000000010000470-0000000010000655:

All I've been doing is randomly killing the server. ~~Let me know if you want the data files.~~ I emailed the data files for this node.

The text was updated successfully, but these errors were encountered:

freeekanayaka · 2019-12-11T22:38:35Z

Yes, please, send me the data files.

freeekanayaka · 2019-12-11T22:39:21Z

Oh I see you did already, ok then, I'll have a look.

freeekanayaka · 2019-12-12T09:56:31Z

The error message that raft reports in this case is wrong. I fixed it in canonical/raft#96, so now you'll get:

raft_start(): io: load closed segment 0000000010000470-0000000010000655: found 185 entries (expected 186)

which means that there's a mismatch between the number of raft log entries encoded in the segment filename and the number of raft log entries actually contained in the segment file.

It's probably a bug in the raft library itself, although I thought version 0.9.12 had a fix for it. Just to double check:

Was the server running version 0.9.12, even before you noticed this error? This is to rule out the possibility that you stopped a server running 0.9.11 or lower, which caused this bug, then you upgraded to 0.9.12 and you noticed it after it had actually occurred.
How are you "randomly killing" the server exactly? With kill -KILL <pid>? Or with some kind of graceful shutdown, e.g. catching SIGTERM/SIGINT and starting some shutdown sequence that eventually calls dqlite.Node.Close() and waits for it to return?

freeekanayaka · 2019-12-12T10:21:58Z

What is puzzling is that the 0000000010000470-0000000010000655 segment is not the last one that was written, which is I what I would expect if the bug here was the one supposedly fixed in 0.9.7.

There are also several other segments past 0000000010000470-0000000010000655, which suggest that the bug occurred somewhere during normal operation, not at shutdown, and then the issue manifested itself only when the server restarted.

It's the first time I see something like this, I'm wondering if recent changes introduced a regression, although in my own tests I've not hit it.

ibuildthecloud · 2019-12-13T01:22:28Z

This environment has been very chaotic, so it's hard to say anything for certain but I believe I was running libraft v0.9.9 during which time I experienced a lot of failures. While I attempt to call node.Close() there's no real guarantee that it completes before the process terminates. So this node was often stuck in reboot loops where the dqlite server would start and then the dqlite client would encounter an error and crash. So a lot of improper shutdowns. Then I updated to libraft v0.9.12 and libco, sqlite, and dqlite as of 12/12. The system seemed more stable after that. Then I was testing more failure scenarios by doing sigterm (but again I don't know if node.Close() would really exit fully) that's when I got this error.

ibuildthecloud · 2019-12-13T05:57:00Z

I've reproduced this error 2 other times now. I can't consistently do it, but it seems if I just keep randomly restarting things I eventually hit this.

freeekanayaka · 2019-12-13T11:08:26Z

Any chance you could provide me with the code and the steps to reproduce exactly what you are doing to trigger the issue?

ibuildthecloud · 2019-12-13T13:43:02Z

I'll see if I can narrow it down to something small and reproducible. It might help me a bit to narrow this down if you could explain what the different data files are and when/how are they written to.

Are these numbered files the current log? Are they append only or do you rewrite them on compaction? When does compaction occur? I'm guessing I'm killing the server at the exact right time during some initialization or data housekeeping task.

freeekanayaka · 2019-12-13T14:11:57Z

The semantics of those files is basically the same as the one found in LogCabin and etcd, see this docstring in the raft library:

https://github.com/canonical/raft/blob/master/include/raft/uv.h#L24

To reply to your questions: yes the numbered files are the current log, they are append-only and might be deleted upon compaction, compaction occurs approx every 1k transactions (currently there's no API to tweak that, but it will come).

Even if you don't find something small, it's okay for me, I can find my way, as long as I can build from source and run the thing.

freeekanayaka · 2019-12-13T14:17:10Z

At least according to the state that you sent me by email, the problem seems to have occurred some time before asking to terminate the process, so it doesn't seem to be caused by the act of terminating it.

ibuildthecloud · 2019-12-14T03:21:08Z

I'll try to give you a reproducible setup. Right now I'm working off of a local dev branch but I'll commit it soon.

ibuildthecloud · 2019-12-15T00:01:19Z

I'm still working on this, honestly might take a couple days because I'm blocked on working on k8s v1.17 upgrade.

Kampe · 2020-03-01T07:38:47Z

Any news on this one here? Prevents multi-master from being a real possibility within k3s

freeekanayaka · 2020-03-01T09:58:47Z

Any hint about how to reproduce it?

Kampe · 2020-03-01T19:52:04Z

I'm able to reproduce this scenario while standing up 3 k3s nodes managed via k3os pretty reliability as the cluster will regularly get out of sync and become unstable almost on boot, unsure what's happening there but I get the same raft_start(): io: load closed segment 0000000010000470-0000000010000655: found 185 entries (expected 186) log output from k3s constantly on my master nodes.

freeekanayaka · 2020-03-01T20:13:39Z

@Kampe that's very good, it sounds I should be able to reproduce that too. Care to provide a detailed list of steps that are likely to lead to the issue? Being able to see it on my side is probably the only effective way we're going to nail it.

Kampe · 2020-03-02T01:59:02Z

That works, here's what I do:

Create/download k3os 0.9.0 ISO and rip it to a usb device (can probably use virtualbox here I'd assume although the most interesting behavior I've seen is getting it to run on real devices)

Using cloud init files, standup your 'primary' node, the first of your nodes need to stand up and initialize the cluster to do so I use this in my cloud init file:

k3os:
  token: secret
  k3s_args:
  - server
  - "--disable-cloud-controller"
  - "--disable-network-policy"
  - "--cluster-init"

I then stand up two more 'secondary' nodes and join them to the cluster with this cloud init:

k3os:
  server_url: https://k3os-primary:6443
  token: secret
  k3s_args:
  - server
  - "--disable-cloud-controller"
  - "--disable-network-policy"

Once all three are up and clustered, it won't take long for them to get out of sync, if you'd really like to see some interesting behavior, turn off the "primary" node and turn it back on and let it try to rejoin the cluster, however I've had successes with bringing this error up just by just letting leader election take place and waiting for control plane to swap nodes.

I also lay down a handful of manifests with cloud init as well to get things going when my cluster initializes. Let me know if those may be helpful for you as well.

freeekanayaka · 2020-03-02T08:36:37Z

What version of go-dqlite/dqlite/raft is used exactly in the 0.9.0 release? @ibuildthecloud

freeekanayaka · 2020-03-02T10:02:37Z

Once all three are up and clustered, it won't take long for them to get out of sync

I assume they get "out of sync" only as consequence of some sort of restart, or does it happen also when no service gets restarted?

if you'd really like to see some interesting behavior, turn off the "primary" node and turn it back on and let it try to rejoin the cluster, however I've had successes with bringing this error up just by just letting leader election take place and waiting for control plane to swap nodes.

I'll try that.

@ibuildthecloud: if you are not doing this already, one thing that will massively improve things here is to use the "leadership transfer" API recently added to (go-)dqlite. Essentially when you want to cleanly shutdown the leader you first try to transfer leadership to another node.

I also lay down a handful of manifests with cloud init as well to get things going when my cluster initializes. Let me know if those may be helpful for you as well.

What are these cloud inits for? Isn't k83os supposed to do everything is needed?

Kampe · 2020-03-02T22:39:37Z

The inits are for my own configuration on cluster init, we lay down a handful of operators and set up some "site specific" secrets into the /var/lib/rancher/k3s/server/manifests/ directory which allows us the apply everything in the directory at runtime. (Previously only been running with a single master and a handful of agent nodes.)

I only stand these up as part of a custom ISO pointing to cloud inits, which also laying down a handful of other things, so I haven't tested HA k3s without having a handful of services and deployments layed down on the cluster, but as soon as a cluster is up and all three nodes join within 10 minutes raft gets pretty angry and starts with the logging above.

thomasdba · 2020-03-06T01:37:26Z

any workaround for this issue ? anyway to make the data in sync manually ?

freeekanayaka · 2020-03-06T07:28:18Z

For the specific error that was reported here, you can try to delete the last segment in the log (the file 0000000010000470-0000000010000655).

For syncing nodes manually, you can pick the node whose data directory has the longest log, and copy it to the other nodes.

thomasdba · 2020-03-06T22:23:52Z

thanks, that works, what is the root cause ?

freeekanayaka · 2020-03-07T06:15:26Z

I'm not entirely sure of what happened, I'd expect some sort of crash or unclean shutdown. Any more clue from you guys about that is welcome :)

thomasdba · 2020-03-08T02:16:13Z

my env is , three vms created by multipass , shutdown the host without stopping k3s .

freeekanayaka · 2020-04-08T13:19:48Z

Things should improve quite a bit with canonical/raft#122 which landed against the C raft library.

I tested a 3-node Kubernetes cluster backed by 3 kine instances clustered together, and observed crashes and issues that led to the above PR and to some other PRs in go-dqlite. After these fixes I haven't noticed issues anymore and the system seems stable.

I'm still not 100% sure that the bug reported in this issue is fixed by the above changes, but it might well if it was some side-effect of those issues that I've seen.

aktaksh · 2020-11-24T06:01:01Z

Seem this is recurrently happening so i will try with external db as current state is pretty unstable.

tceduard · 2021-02-04T17:46:42Z

I've observed the same error when running k3s v1.18.15+k3s1
the error has disappeared in k3s v1.19.5+k3s1

Kampe · 2021-02-10T22:44:22Z

Yep ^^ k3s has removed any use of dqlite in the latest images and replaced with etcd.

MathieuBordere · 2022-05-06T13:12:45Z

I'll close this one due to inactivity.

brandond mentioned this issue Feb 27, 2020

k3s startup fails with "starting kubernetes: preparing server: start cluster and https: raft_start(): io: load closed segment 0000000024946269-0000000024946590: found 321 entries (expected 322)" k3s-io/k3s#1403

Closed

freeekanayaka mentioned this issue Mar 25, 2020

Bump go-dqlite to v1.5.1 k3s-io/kine#29

Merged

SteelCrow mentioned this issue Apr 14, 2020

io: load closed segment 0000000000064324-0000000000064793: found 469 entries (expected 470) k3s-io/k3s#1639

Closed

dionysius mentioned this issue Jun 4, 2021

closed segment X is past last snapshot Y - due to disk full #309

Closed

MathieuBordere closed this as completed May 6, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lost node due to raft issue #190

Lost node due to raft issue #190

ibuildthecloud commented Dec 11, 2019 •

edited

Loading

freeekanayaka commented Dec 11, 2019

freeekanayaka commented Dec 11, 2019

freeekanayaka commented Dec 12, 2019

freeekanayaka commented Dec 12, 2019 •

edited

Loading

ibuildthecloud commented Dec 13, 2019

ibuildthecloud commented Dec 13, 2019

freeekanayaka commented Dec 13, 2019

ibuildthecloud commented Dec 13, 2019

freeekanayaka commented Dec 13, 2019

freeekanayaka commented Dec 13, 2019

ibuildthecloud commented Dec 14, 2019

ibuildthecloud commented Dec 15, 2019

Kampe commented Mar 1, 2020 •

edited

Loading

freeekanayaka commented Mar 1, 2020

Kampe commented Mar 1, 2020

freeekanayaka commented Mar 1, 2020

Kampe commented Mar 2, 2020 •

edited

Loading

freeekanayaka commented Mar 2, 2020

freeekanayaka commented Mar 2, 2020

Kampe commented Mar 2, 2020 •

edited

Loading

thomasdba commented Mar 6, 2020

freeekanayaka commented Mar 6, 2020

thomasdba commented Mar 6, 2020

freeekanayaka commented Mar 7, 2020

thomasdba commented Mar 8, 2020

freeekanayaka commented Apr 8, 2020

aktaksh commented Nov 24, 2020

tceduard commented Feb 4, 2021

Kampe commented Feb 10, 2021 •

edited

Loading

MathieuBordere commented May 6, 2022

Lost node due to raft issue #190

Lost node due to raft issue #190

Comments

ibuildthecloud commented Dec 11, 2019 • edited Loading

freeekanayaka commented Dec 11, 2019

freeekanayaka commented Dec 11, 2019

freeekanayaka commented Dec 12, 2019

freeekanayaka commented Dec 12, 2019 • edited Loading

ibuildthecloud commented Dec 13, 2019

ibuildthecloud commented Dec 13, 2019

freeekanayaka commented Dec 13, 2019

ibuildthecloud commented Dec 13, 2019

freeekanayaka commented Dec 13, 2019

freeekanayaka commented Dec 13, 2019

ibuildthecloud commented Dec 14, 2019

ibuildthecloud commented Dec 15, 2019

Kampe commented Mar 1, 2020 • edited Loading

freeekanayaka commented Mar 1, 2020

Kampe commented Mar 1, 2020

freeekanayaka commented Mar 1, 2020

Kampe commented Mar 2, 2020 • edited Loading

freeekanayaka commented Mar 2, 2020

freeekanayaka commented Mar 2, 2020

Kampe commented Mar 2, 2020 • edited Loading

thomasdba commented Mar 6, 2020

freeekanayaka commented Mar 6, 2020

thomasdba commented Mar 6, 2020

freeekanayaka commented Mar 7, 2020

thomasdba commented Mar 8, 2020

freeekanayaka commented Apr 8, 2020

aktaksh commented Nov 24, 2020

tceduard commented Feb 4, 2021

Kampe commented Feb 10, 2021 • edited Loading

MathieuBordere commented May 6, 2022

ibuildthecloud commented Dec 11, 2019 •

edited

Loading

freeekanayaka commented Dec 12, 2019 •

edited

Loading

Kampe commented Mar 1, 2020 •

edited

Loading

Kampe commented Mar 2, 2020 •

edited

Loading

Kampe commented Mar 2, 2020 •

edited

Loading

Kampe commented Feb 10, 2021 •

edited

Loading