-
Notifications
You must be signed in to change notification settings - Fork 175
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
During rollout restart operator doesn't wait until previously restarted pod becomes part of a Scylla cluster #1077
Comments
We wait for readiness, I wonder if this is the case where it becomes ready without having a quorum. |
@vponomaryov do we a run of this with scylla-core latest master ? |
I didn't run it with master. Only with the specified versions in the bug report. |
Can we run it at least once with master version ? And then we should try having a dtest reproducer of it. our rolling restart process shouldn't be that brittle. |
The error is expected because a Scylla node always becomes |
We can slow down traffic between two nodes |
The 'test_nodetool_flush_and_reshard' test fails due to the following bug: scylladb/scylla-operator#1077 So, skip it.
The probability to hit this bug is very high. |
The 'test_nodetool_flush_and_reshard' test is affected by the following bug: scylladb/scylla-operator#1077 using non-fast K8S backends such as 'k8s-gke' and 'k8s-local-kind'. So, skip it on the slow backends.
The 'test_nodetool_flush_and_reshard' test is affected by the following bug: scylladb/scylla-operator#1077 using non-fast K8S backends such as 'k8s-gke' and 'k8s-local-kind'. So, skip it on the slow backends.
@tnozicka - I'm unsure if our logic should not be converted to 'ask other nodes what node X status is' instead of 'ask node X what its status is' ... |
Operator's racy bug [1] gets reproduced on slow backends and nodes with small number of cores. So, skip it for the nodes that are not running on the EKS and have less than 14 cores. [1] scylladb/scylla-operator#1077
Yeah, currently readiness project ability to talk CONSISTENCY_ONE and we are talking about migrating it project CONSISTENCY_QUORUM. @rzetelskik can you summarize the update of your discoveries we talked about today? |
Operator's racy bug [1] gets reproduced on slow backends and nodes with small number of cores. So, skip it for the nodes that are not running on the EKS and have less than 14 cores. [1] scylladb/scylla-operator#1077
@rzetelskik @tnozicka can one of you update on this one, and shed some light on what's "project CONSISTENCY_QUORUM" in this context, cause I don't understand what does it mean... |
Sure. In an effort to reproduce this I've added a simple coroutine to our E2E test suite that'd just send read queries periodically with quorum consistency and it caused the suite to fail quite consistently, so I haven't yet followed up on why this only kept happening on GKE for you guys, since the issue seems more general anyway. Our readiness probe only checks whether the node considers itself UN. We've hit issues related to it and discussed it many times so that's nothing new. By the time a node declares readiness, the gossip should've settled, although nothing stops it from settling with a split brain, so it's quite easy to hit the following sequence:
At this point you'll get errors from queries with quorum consistency regardless of which node you hit. So we've discussed different approaches to how we could modify the readiness probe or the operator to get closer to what the documentation advises, i.e. all nodes UP before performing a topology change. The current idea suggested by @tnozicka is that each node should ask all of its peers about statuses of their peers and only declare readiness if all of them consider at least N-1 of their peers UN, where N is the total numbers of peers.
I think what Tomas's message meant is that currently the readiness probe only declares that a given node is able to serve queries with consistency ONE (which is only true if that node replicates the data in question) and at that point we were discussing implementing a heuristic where a node would check whether a quorum of its peers is UN. |
Issue description
It happens not only when changing number of cores. The same situation happened on GKE typical rolling restart.
just before cdc errors node-1 restarted:
ImpactCluster is not operable for CL=QUORUM when RF=3 How frequently does it reproduce?During this run we had only one RollingRestart nemesis and it failed. Observed on GKE. Installation detailsKernel Version: 5.15.0-1023-gke Operator Image: scylladb/scylla-operator:latest Scylla Nodes used in this run: OS / Image: Test: Logs and commands
Logs:
|
@zimnx thanks for diving into it. I wonder how many more times conntrack is going to bite us. 😛
Just to understand it, could you explain the calculation here?
Remember that such workaround would require running with additional linux capabilities. Looking at the conntrack cleaner daemonset (9a63e01), which we used in our CI to resolve a similar issue in the past (#971), it would require NET_ADMIN. I also wonder if that wouldn't affect stability of otherwise healthy clusters.
I recall you mentioned 180s not being a value high enough to completely rid us of errors, which is quite surprising now given the 127s above. Was that caused by a higher syn_sent conntrack timeout?
It's worth explicitly stating that the issue comes from running kube-proxy depending on netfilter's conntrack. From my understanding at this point we're not sure if this is specific to running kube-proxy in iptables mode, or if that also occurs in ipvs mode. @zimnx have you tried it? Anyway, an approach that gets rid of the issue without us implementing any workarounds is to advise our users to remove the netfilter dependency if they require exposing their nodes on ClusterIPs. Like you said above, you haven't hit the issue while running in GKE with Dataplane V2 (kube-proxy-less Cillium). |
https://www.kernel.org/doc/Documentation/networking/ip-sysctl.txt
Our nodesetup pods are already running as root, so no extra permissions are required.
Removing only SYN_SENT entries to clusterIPs and Scylla ports after configurable and reasonable timeout should only cause more reconnection attempts.
That's what I plan to look into next, maybe there's something else causing disruption.
Nope
I haven't tried with different node teardown logic. It's something I want to tryout later as well. |
Doesn't work, this setting needs to be changed on node, not in container network namespace. |
I added setting of conntrack TCP timeout of SYN_SENT entries to our node setup DaemonSet, it solved availability issues on both Cluster and PodIP settings without minReadySeconds.
We no longer see any availability errors meaning Scylla rolls out without loosing Quorum. Since we found configuration where we no longer observe any availability issues, I verified how different Scylla versions behave.
Versions >=5.1 cause request failures during when node is shutting down. Changing
Looks like setting |
@zimnx could we make it an optional feature in node setup? The users would then have a choice between running on PodIPs without any issues, using this workaround, or running a kube-proxy-less cluster, which sounds like reasonably many choices. |
Issues in scylla regarding loss of availability during node shutdown on newer versions and rpc_client connection timeout |
Great job drilling down into this! I don't think we want to get into managing contrack settings for a subset of platforms in node setup DS. It seems fragile, potentially affecting other workloads and quite platform dependent. Scylla timeout seems like more fitting and stable place even if we have to wait a bit. |
Are we sure shutdown is not hung, which causes RPC connections not to close properly, from both sides? |
@xemul looked at multiple RPC issues recently, may be he has some idea. |
Rollout proceeds, process dies some time after we drain it and kill it without need for ungraceful shutdown. |
After couple of days I have doubts whether timeout on rpc_client connection would solve the issue. I think, it would only postpone fixing the real issue. The reason why timeout would help us, is because it would give nodes another attempt to connect to restarted node. New attempt usually means new source port, session would have new ip/port tuple which requires iptables traversal. Effectively using updated ClusterIP mapping. And this is what fixes the issue we hit with ClusterIP clusters.
We can force connection timeout by namespaced Looks like even with configurable timeout, we wouldn't be able to survive rollouts without availability loss, because UN information from single node is not enough. Seems like one of the first idea about asking multiple nodes about status of ourselves would be a good way to solve these availability issues. Together with recommendation to set additional sysctls they should make readiness pass quick enough. |
We won't. |
The Scylla Operator project currently lacks enough contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
/lifecycle stale |
The Scylla Operator project currently lacks enough contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
/lifecycle rotten |
/remove-lifecycle rotten |
Describe the bug
If we change the number of cores for Scylla (it will trigger rollout pod restart) then we start facing following racy condition:
First pod gets restarted and if it gets into a cluster with some unexpected delay then operator doesn't wait for it and restarts second pod.
In a cluster of 3 nodes it makes cluster be inoperational not having quorum for several minutes.
It happens on GKE and doesn't happen on EKS because in the latter one case a restarted pod gets into a cluster much faster.
Below are the logs from a run on GKE cluster.
pod-2
(the first restarted) logs:pod-1
logs (second restarted):So, from the logs above we can see that the time when
pod-1
(second restarted) gotshutdown
message the first onewas not ready yet. The appeared
quorum
error from CDC is the proof for it.To Reproduce
Steps to reproduce the behavior:
scyllacluster
specExpected behavior
Operator should wait for the previous pod return back to a Scylla cluster before restarting next one.
Logs
K8S logs: https://cloudius-jenkins-test.s3.amazonaws.com/0512c157-c4d7-4d3f-9b44-2bcce9d34de9/20221027_191640/kubernetes-0512c157.tar.gz
DB logs: https://cloudius-jenkins-test.s3.amazonaws.com/0512c157-c4d7-4d3f-9b44-2bcce9d34de9/20221027_191640/db-cluster-0512c157.tar.gz
Environment:
The text was updated successfully, but these errors were encountered: