Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

During rollout restart operator doesn't wait until previously restarted pod becomes part of a Scylla cluster #1077

Open
Tracked by #1577
vponomaryov opened this issue Oct 28, 2022 · 40 comments · May be fixed by #1614
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@vponomaryov
Copy link
Contributor

Describe the bug
If we change the number of cores for Scylla (it will trigger rollout pod restart) then we start facing following racy condition:
First pod gets restarted and if it gets into a cluster with some unexpected delay then operator doesn't wait for it and restarts second pod.
In a cluster of 3 nodes it makes cluster be inoperational not having quorum for several minutes.
It happens on GKE and doesn't happen on EKS because in the latter one case a restarted pod gets into a cluster much faster.

Below are the logs from a run on GKE cluster.
pod-2 (the first restarted) logs:

2022-10-27 18:40:55,710 INFO waiting for scylla to stop
INFO  2022-10-27 18:40:55,710 [shard 0] init - Signal received; shutting down
...
INFO  2022-10-27 18:40:57,897 [shard 0] gossip - Gossip is now stopped
...
INFO  2022-10-27 18:40:59,866 [shard 0] init - Scylla version 5.1.0~rc3-0.20221009.9deeeb4db1cd shutdown complete.
2022-10-27 18:40:59,940 INFO stopped: scylla (exit status 0)
...
I1027 18:41:03.305900       1 operator/sidecar.go:438] "Scylla process finished"
rpc error: code = NotFound desc = an error occurred when try to find container "ef77bed95aa5838282bc5d55420a2718d6b111912fc2954dc645c35c7ce87d3f": not foundI1027 18:41:11.371885       1 operator/sidecar.go:147] sidecar version "v1.8.0-alpha.0-133-g97c831e"
...
I1027 18:41:12.363766       1 operator/sidecar.go:360] "Starting scylla"
...
INFO  2022-10-27 18:41:15,764 [shard 0] database - Resharded 7kB for system.compaction_history in 0.54 seconds, 13kB/s
...
INFO  2022-10-27 18:41:18,583 [shard 0] init - loading system_schema sstables
...
INFO  2022-10-27 18:41:19,311 [shard 0] init - setting up system keyspace
...
INFO  2022-10-27 18:41:21,558 [shard 0] database - Resharded 235MB for keyspace1.standard1 in 2.02 seconds, 116MB/s
...
INFO  2022-10-27 18:41:21,621 [shard 0] storage_service - entering STARTING mode
...
INFO  2022-10-27 18:41:21,700 [shard 0] storage_service - Starting up server gossip
...
E1027 18:41:24.931854       1 sidecar/probes.go:110] "readyz probe: can't get scylla native transport" err="giving up after 1 attempts: agent [HTTP 404] Not found" Service="scylla/sct-cluster-us-east1-b-us-east1-2" Node="10.80.11.2"
...
E1027 18:41:34.932431       1 sidecar/probes.go:110] "readyz probe: can't get scylla native transport" err="giving up after 1 attempts: agent [HTTP 404] Not found" Service="scylla/sct-cluster-us-east1-b-us-east1-2" Node="10.80.11.2"
...
INFO  2022-10-27 18:41:44,256 [shard 0] storage_service - entering JOINING mode
...
INFO  2022-10-27 18:41:44,828 [shard 0] storage_service - Node 10.80.11.2 state jump to normal
E1027 18:41:44.931320       1 sidecar/probes.go:110] "readyz probe: can't get scylla native transport" err="giving up after 1 attempts: agent [HTTP 404] Not found" Service="scylla/sct-cluster-us-east1-b-us-east1-2" Node="10.80.11.2"
INFO  2022-10-27 18:41:45,035 [shard 0] storage_service - entering NORMAL mode
...
INFO  2022-10-27 18:41:57,051 [shard 0] init - Scylla version 5.1.0~rc3-0.20221009.9deeeb4db1cd initialization completed.
WARN  2022-10-27 18:42:45,043 [shard 0] cdc - Could not update CDC description table with generation (2022/10/27 18:24:22, 69265668-5186-4a22-a2b1-6b6bad5a0f55): exceptions::unavailable_exception (Cannot achieve consistency level for cl QUORUM. Requires 2, alive 1). Will try again.
WARN  2022-10-27 18:43:14,796 [shard 0] gossip - Skip marking node 10.80.4.107 with status = shutdown as UP
INFO  2022-10-27 18:43:14,796 [shard 0] gossip - InetAddress 10.80.14.183 is now UP, status = NORMAL
INFO  2022-10-27 18:43:14,798 [shard 0] storage_service - Node 10.80.14.183 state jump to normal
INFO  2022-10-27 18:43:14,801 [shard 0] storage_service - Node 10.80.4.107 state jump to normal
INFO  2022-10-27 18:43:14,950 [shard 0] gossip - InetAddress 10.80.12.113 is now DOWN, status = LEFT
INFO  2022-10-27 18:43:14,952 [shard 0] gossip - Node 10.80.12.113 will be removed from gossip at [2022-10-30 18:31:53]: (expire = 1667154713831435293, now = 1666896194952348069, diff = 258518 seconds)
...
WARN  2022-10-27 18:43:36,752 [shard 0] gossip - Fail to send EchoMessage to 10.80.4.107: seastar::rpc::closed_error (connection is closed)
INFO  2022-10-27 18:43:36,892 [shard 0] gossip - InetAddress 10.80.4.107 is now UP, status = NORMAL

pod-1 logs (second restarted):

...
INFO  2022-10-27 18:32:55,041 [shard 0] gossip - 60000 ms elapsed, 10.80.12.113 gossip quarantine over
INFO  2022-10-27 18:40:55,897 [shard 0] gossip - Got shutdown message from 10.80.11.2, received_generation=1666893961, local_generation=1666893961
INFO  2022-10-27 18:40:55,898 [shard 0] gossip - InetAddress 10.80.11.2 is now DOWN, status = shutdown
2022-10-27 18:42:05,276 INFO waiting for scylla to stop
INFO  2022-10-27 18:42:05,291 [shard 0] compaction_manager - Asked to stop
...

So, from the logs above we can see that the time when pod-1 (second restarted) got shutdown message the first one
was not ready yet. The appeared quorum error from CDC is the proof for it.

To Reproduce
Steps to reproduce the behavior:

  1. Deploy operator
  2. Deploy 3-pod Scylla cluster on GKE
  3. Run some constant load
  4. Change the number of CPUs in the scyllacluster spec
  5. Wait for the restart of the second pod
  6. See quorum-errors on the loader side.

Expected behavior
Operator should wait for the previous pod return back to a Scylla cluster before restarting next one.

Logs
K8S logs: https://cloudius-jenkins-test.s3.amazonaws.com/0512c157-c4d7-4d3f-9b44-2bcce9d34de9/20221027_191640/kubernetes-0512c157.tar.gz
DB logs: https://cloudius-jenkins-test.s3.amazonaws.com/0512c157-c4d7-4d3f-9b44-2bcce9d34de9/20221027_191640/db-cluster-0512c157.tar.gz

Environment:

  • Platform: GKE
  • Kubernetes version: 1.22
  • Scylla version: 5.0.5 , 5.1.0-rc3
  • Scylla-operator version: e.g.: v1.8.0-alpha.0-133-g97c831e | v1.7.4
@vponomaryov vponomaryov added the kind/bug Categorizes issue or PR as related to a bug. label Oct 28, 2022
@tnozicka
Copy link
Member

We wait for readiness, I wonder if this is the case where it becomes ready without having a quorum.

@tnozicka tnozicka added this to the v1.9 milestone Oct 31, 2022
@tnozicka tnozicka added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Oct 31, 2022
@fruch
Copy link

fruch commented Nov 2, 2022

@vponomaryov do we a run of this with scylla-core latest master ?

@vponomaryov
Copy link
Contributor Author

  • 5.1.0-rc3

I didn't run it with master. Only with the specified versions in the bug report.

@fruch
Copy link

fruch commented Nov 2, 2022

Can we run it at least once with master version ?

And then we should try having a dtest reproducer of it. our rolling restart process shouldn't be that brittle.

@vponomaryov
Copy link
Contributor Author

Can we run it at least once with master version ?

And then we should try having a dtest reproducer of it. our rolling restart process shouldn't be that brittle.

Done here: https://jenkins.scylladb.com/view/staging/job/scylla-staging/job/valerii/job/vp-longevity-scylla-operator-3h-gke/17/consoleFull

The error is expected because a Scylla node always becomes UN for itself first and only then to others.
So, what we really should try to keep is slowness like in the GKE platform.
For example, this bug is not reproduced using all-in-one K8S (kind).
So, dtest reproducer will need to have some tricks to cause slow status updates.

@fruch
Copy link

fruch commented Nov 2, 2022

We can slow down traffic between two nodes

@tnozicka tnozicka assigned rzetelskik and unassigned zimnx Nov 14, 2022
vponomaryov added a commit to vponomaryov/scylla-cluster-tests that referenced this issue Nov 15, 2022
The 'test_nodetool_flush_and_reshard' test fails due to the following
bug: scylladb/scylla-operator#1077
So, skip it.
@vponomaryov
Copy link
Contributor Author

vponomaryov commented Nov 15, 2022

The probability to hit this bug is very high.
Our nodetool flush and reshard K8S test, which triggers rollout restart, constantly suffers from it.
Latest failure is here: https://jenkins.scylladb.com/job/scylla-operator/job/operator-master/job/functional/job/functional-k8s-local-kind-aws/25/consoleFull
The appropriate nemesis is skipped constantly already.

vponomaryov added a commit to vponomaryov/scylla-cluster-tests that referenced this issue Nov 16, 2022
The 'test_nodetool_flush_and_reshard' test is affected by the following
bug:

    scylladb/scylla-operator#1077

using non-fast K8S backends such as 'k8s-gke' and 'k8s-local-kind'.
So, skip it on the slow backends.
fruch pushed a commit to scylladb/scylla-cluster-tests that referenced this issue Nov 16, 2022
The 'test_nodetool_flush_and_reshard' test is affected by the following
bug:

    scylladb/scylla-operator#1077

using non-fast K8S backends such as 'k8s-gke' and 'k8s-local-kind'.
So, skip it on the slow backends.
@mykaul
Copy link
Contributor

mykaul commented Nov 22, 2022

@tnozicka - I'm unsure if our logic should not be converted to 'ask other nodes what node X status is' instead of 'ask node X what its status is' ...

vponomaryov added a commit to vponomaryov/scylla-cluster-tests that referenced this issue Nov 23, 2022
Operator's racy bug [1] gets reproduced on slow backends and
nodes with small number of cores.
So, skip it for the nodes that are not running on the EKS and have
less than 14 cores.

[1] scylladb/scylla-operator#1077
@tnozicka
Copy link
Member

tnozicka commented Nov 23, 2022

Yeah, currently readiness project ability to talk CONSISTENCY_ONE and we are talking about migrating it project CONSISTENCY_QUORUM.

@rzetelskik can you summarize the update of your discoveries we talked about today?

fruch pushed a commit to scylladb/scylla-cluster-tests that referenced this issue Nov 27, 2022
Operator's racy bug [1] gets reproduced on slow backends and
nodes with small number of cores.
So, skip it for the nodes that are not running on the EKS and have
less than 14 cores.

[1] scylladb/scylla-operator#1077
@fruch
Copy link

fruch commented Nov 28, 2022

Yeah, currently readiness project ability to talk CONSISTENCY_ONE and we are talking about migrating it project CONSISTENCY_QUORUM.

@rzetelskik can you summarize the update of your discoveries we talked about today?

@rzetelskik @tnozicka can one of you update on this one, and shed some light on what's "project CONSISTENCY_QUORUM" in this context, cause I don't understand what does it mean...

@rzetelskik
Copy link
Member

rzetelskik commented Nov 28, 2022

@rzetelskik can you summarize the update of your discoveries we talked about today?

Sure. In an effort to reproduce this I've added a simple coroutine to our E2E test suite that'd just send read queries periodically with quorum consistency and it caused the suite to fail quite consistently, so I haven't yet followed up on why this only kept happening on GKE for you guys, since the issue seems more general anyway.

Our readiness probe only checks whether the node considers itself UN. We've hit issues related to it and discussed it many times so that's nothing new. By the time a node declares readiness, the gossip should've settled, although nothing stops it from settling with a split brain, so it's quite easy to hit the following sequence:

  1. Entire 3 node cluster up: {1: {1: UN, 2: UN, 3:UN}, 2: {1: UN, 2: UN, 3: UN}, 3: {1: UN, 2: UN, 3: UN}}
  2. Restart node 3 and wait for it to come back up and declare readiness.
  3. Node 3 is up but not a part of the cluster: {1: {1: UN, 2: UN, 3:DN}, 2: {1: UN, 2: UN, 3: DN}, 3: {1: DN, 2: DN, 3: UN}}
  4. Restart node 2: {1: {1: UN, 2: DN, 3:DN}, 3: {1: DN, 2: DN, 3: UN}}

At this point you'll get errors from queries with quorum consistency regardless of which node you hit.

So we've discussed different approaches to how we could modify the readiness probe or the operator to get closer to what the documentation advises, i.e. all nodes UP before performing a topology change.
Essentially we want our readiness probes' semantics to no longer just be "I'm ready to serve traffic" but "I'm ready for a topology change". However we weren't able to just declare readiness when, and only when, a node considers all of its peers UN, since that caused some issues with rollouts.

The current idea suggested by @tnozicka is that each node should ask all of its peers about statuses of their peers and only declare readiness if all of them consider at least N-1 of their peers UN, where N is the total numbers of peers.

@rzetelskik @tnozicka can one of you update on this one, and shed some light on what's "project CONSISTENCY_QUORUM" in this context, cause I don't understand what does it mean...

I think what Tomas's message meant is that currently the readiness probe only declares that a given node is able to serve queries with consistency ONE (which is only true if that node replicates the data in question) and at that point we were discussing implementing a heuristic where a node would check whether a quorum of its peers is UN.

@soyacz
Copy link

soyacz commented Jan 31, 2023

Issue description

  • This issue is a regression.
  • It is unknown if this issue is a regression.

It happens not only when changing number of cores. The same situation happened on GKE typical rolling restart.
node-2:

INFO  2023-01-29 02:30:46,965 [shard 0] init - Scylla version 5.3.0~dev-0.20230127.5eadea301ecc initialization completed.
WARN  2023-01-29 02:31:34,956 [shard 0] cdc - Could not update CDC description table with generation (2023/01/29 01:40:35, 970df7c8-82ce-403b-b4a9-4804eaa2cee8): exceptions::unavailable_exception (Cannot achieve consistency level for cl QUORUM. Requires 2, alive 1). Will try again.

just before cdc errors node-1 restarted:

2023-01-29 02:30:57,175 INFO waiting for scylla to stop
INFO  2023-01-29 02:30:57,180 [shard 0] compaction_manager - Asked to stop

Impact

Cluster is not operable for CL=QUORUM when RF=3

How frequently does it reproduce?

During this run we had only one RollingRestart nemesis and it failed. Observed on GKE.

Installation details

Kernel Version: 5.15.0-1023-gke
Scylla version (or git commit hash): 5.3.0~dev-20230127.5eadea301ecc with build-id 057196c711d6c52ca6781558c576c858a3cf049b

Operator Image: scylladb/scylla-operator:latest
Operator Helm Version: v1.9.0-alpha.0-41-g04c20d3
Operator Helm Repository: https://storage.googleapis.com/scylla-operator-charts/latest
Cluster size: 4 nodes (n1-standard-8)

Scylla Nodes used in this run:
No resources left at the end of the run

OS / Image: N/A (k8s-gke: us-east1)

Test: longevity-scylla-operator-3h-gke
Test id: c3397295-e68c-4ce5-bc39-3bd16e034133
Test name: scylla-operator/operator-master/gke/longevity-scylla-operator-3h-gke
Test config file(s):

Logs and commands
  • Restore Monitor Stack command: $ hydra investigate show-monitor c3397295-e68c-4ce5-bc39-3bd16e034133
  • Restore monitor on AWS instance using Jenkins job
  • Show all stored logs command: $ hydra investigate show-logs c3397295-e68c-4ce5-bc39-3bd16e034133

Logs:

Jenkins job URL

@rzetelskik
Copy link
Member

rzetelskik commented Oct 27, 2023

@zimnx thanks for diving into it. I wonder how many more times conntrack is going to bite us. 😛

On GKE, maximum SYN packets are retransmissed 6 times, meaning from first SYN to failure it can take 127s

Just to understand it, could you explain the calculation here?

  • remove UNREPLIED conntrack entries for ScyllaCluster Pods. When next SYN retransmission happens and there's no conntrack entry, connection attempt fails and Scylla reconnects. This could be a good workaround until timeout is introduced in Scylla, as we already have nodesetup privileged Pods pods running in host network on every Scylla node.

Remember that such workaround would require running with additional linux capabilities. Looking at the conntrack cleaner daemonset (9a63e01), which we used in our CI to resolve a similar issue in the past (#971), it would require NET_ADMIN.

I also wonder if that wouldn't affect stability of otherwise healthy clusters.

  • Set minReadySeconds to value high enough to extend maximum SYN retransmission. This has a big downside of cluster boostrap time being increased.

I recall you mentioned 180s not being a value high enough to completely rid us of errors, which is quite surprising now given the 127s above. Was that caused by a higher syn_sent conntrack timeout?

Maybe something else, suggestions are welcome.

It's worth explicitly stating that the issue comes from running kube-proxy depending on netfilter's conntrack. From my understanding at this point we're not sure if this is specific to running kube-proxy in iptables mode, or if that also occurs in ipvs mode. @zimnx have you tried it?

Anyway, an approach that gets rid of the issue without us implementing any workarounds is to advise our users to remove the netfilter dependency if they require exposing their nodes on ClusterIPs. Like you said above, you haven't hit the issue while running in GKE with Dataplane V2 (kube-proxy-less Cillium).

@zimnx
Copy link
Collaborator

zimnx commented Oct 27, 2023

On GKE, maximum SYN packets are retransmissed 6 times, meaning from first SYN to failure it can take 127s

Just to understand it, could you explain the calculation here?

https://www.kernel.org/doc/Documentation/networking/ip-sysctl.txt

tcp_syn_retries - INTEGER
Number of times initial SYNs for an active TCP connection attempt
will be retransmitted. Should not be higher than 127. Default value
is 6, which corresponds to 63seconds till the last retransmission
with the current initial RTO of 1second. With this the final timeout
for an active TCP connection attempt will happen after 127seconds.


Remember that such workaround would require running with additional linux capabilities

Our nodesetup pods are already running as root, so no extra permissions are required.

I also wonder if that wouldn't affect stability of otherwise healthy clusters.

Removing only SYN_SENT entries to clusterIPs and Scylla ports after configurable and reasonable timeout should only cause more reconnection attempts.

I recall you mentioned 180s not being a value high enough to completely rid us of errors, which is quite surprising now given the 127s above. Was that caused by a higher syn_sent conntrack timeout?

That's what I plan to look into next, maybe there's something else causing disruption.

It's worth explicitly stating that the issue comes from running kube-proxy depending on netfilter's conntrack. From my understanding at this point we're not sure if this is specific to running kube-proxy in iptables mode, or if that also occurs in ipvs mode. @zimnx have you tried it?

Nope

Like you said above, you haven't hit the issue while running in GKE with Dataplane V2 (kube-proxy-less Cillium).

I haven't tried with different node teardown logic. It's something I want to tryout later as well.

@zimnx
Copy link
Collaborator

zimnx commented Oct 27, 2023

remove UNREPLIED conntrack entries for ScyllaCluster Pods. When next SYN retransmission happens and there's no conntrack entry, connection attempt fails and Scylla reconnects. This could be a good workaround until timeout is introduced in Scylla, as we already have nodesetup privileged Pods pods running in host network on every Scylla node.

Another option would be to lower net.netfilter.nf_conntrack_tcp_timeout_syn_sent from default 120s to lower value from out sysctls-buddy container. I haven't tested, but with this conntrack entry translating ClusterIP into stale Pod IP would vanish earlier, and next TCP retransmission would fail triggering reconnection. Although it's a global setting.

Leaving as a workaround we might consider.

Doesn't work, this setting needs to be changed on node, not in container network namespace.

@zimnx
Copy link
Collaborator

zimnx commented Oct 30, 2023

I added setting of conntrack TCP timeout of SYN_SENT entries to our node setup DaemonSet, it solved availability issues on both Cluster and PodIP settings without minReadySeconds.

setup requests EOF/request timeout availability failures success ratio
ClusterIP minReadySeconds=0 Scylla=5.0.12 conntrack_timeout_syn_sent=20s 56621 4 0 0.999
PodIP minReadySeconds=0 Scylla=5.0.12 conntrack_timeout_syn_sent=20s 56411 4 0 0.999

We no longer see any availability errors meaning Scylla rolls out without loosing Quorum.
EOFs might be scylla or gocql bug, not related to rollout.
Timeouts may happen, as setups runs with low non-guaranteed resources.

Since we found configuration where we no longer observe any availability issues, I verified how different Scylla versions behave.

setup requests EOF/request timeout availability failures success ratio
ClusterIP minReadySeconds=0 Scylla=5.0.12 conntrack_timeout_syn_sent=20s 33926 3 0 0.999
PodIP minReadySeconds=0 Scylla=5.0.12 conntrack_timeout_syn_sent=20s 33750 4 0 0.999
PodIP minReadySeconds=0 Scylla=5.1.18 conntrack_timeout_syn_sent=20s 31458 1 19 0.999
ClusterIP minReadySeconds=0 Scylla=5.1.18 conntrack_timeout_syn_sent=20s 34919 0 15 0.999
ClusterIP minReadySeconds=0 Scylla=5.2.9 conntrack_timeout_syn_sent=20s 30176 2 16 0.999
PodIP minReadySeconds=0 Scylla=5.2.9 conntrack_timeout_syn_sent=20s 30512 3 13 0.999
ClusterIP minReadySeconds=0 Scylla=5.3.0-rc0 conntrack_timeout_syn_sent=20s 32213 1 11 0.999
PodIP minReadySeconds=0 Scylla=5.3.0-rc0 conntrack_timeout_syn_sent=20s 31847 2 20 0.999
PodIP minReadySeconds=0 Scylla=5.4.0-rc0 conntrack_timeout_syn_sent=20s 40221 1 5 0.999
ClusterIP minReadySeconds=0 Scylla=5.4.0-rc0 conntrack_timeout_syn_sent=20s 40892 2 7 0.999

Versions >=5.1 cause request failures during when node is shutting down.

Changing net.netfilter.nf_conntrack_tcp_timeout_syn_sent to 60s breaks ClusterIP scenarios:

setup requests EOF/request timeout availability failures success ratio
ClusterIP minReadySeconds=0 Scylla=5.0.12 conntrack_timeout_syn_sent=60s 35792 9 0 0.999
PodIP minReadySeconds=0 Scylla=5.0.12 conntrack_timeout_syn_sent=60s 35688 2 0 0.999
PodIP minReadySeconds=0 Scylla=5.1.18 conntrack_timeout_syn_sent=60s 36021 1 21 0.999
ClusterIP minReadySeconds=0 Scylla=5.1.18 conntrack_timeout_syn_sent=60s 37997 9 2620 0.93
ClusterIP minReadySeconds=0 Scylla=5.2.9 conntrack_timeout_syn_sent=60s 36494 10 3055 0.92
PodIP minReadySeconds=0 Scylla=5.2.9 conntrack_timeout_syn_sent=60s 33809 3 10 0.999
ClusterIP minReadySeconds=0 Scylla=5.3.0-rc0 conntrack_timeout_syn_sent=60s 38529 12 2816 0.93
PodIP minReadySeconds=0 Scylla=5.3.0-rc0 conntrack_timeout_syn_sent=60s 40817 2 9 0.999
PodIP minReadySeconds=0 Scylla=5.4.0-rc0 conntrack_timeout_syn_sent=60s 41197 1 3 0.999
ClusterIP minReadySeconds=0 Scylla=5.4.0-rc0 conntrack_timeout_syn_sent=60s 42154 12 3599 0.91

Looks like setting net.netfilter.nf_conntrack_tcp_timeout_syn_sent to 20s would fix ClusterIP setups as it would enforce shorter timeout on rpc_client connection attempts.
Adding configuration option to Scylla which would allow controlling this timeout would allow us to get rid of this workaround.
We also need to fix Scylla, as supported versions have a bug causing availability issues when node is shutting down.

@rzetelskik
Copy link
Member

rzetelskik commented Oct 30, 2023

@zimnx could we make it an optional feature in node setup? The users would then have a choice between running on PodIPs without any issues, using this workaround, or running a kube-proxy-less cluster, which sounds like reasonably many choices.

@zimnx
Copy link
Collaborator

zimnx commented Oct 31, 2023

Issues in scylla regarding loss of availability during node shutdown on newer versions and rpc_client connection timeout
scylladb/scylladb#15899
scylladb/scylladb#15901

@tnozicka
Copy link
Member

Great job drilling down into this!

I don't think we want to get into managing contrack settings for a subset of platforms in node setup DS. It seems fragile, potentially affecting other workloads and quite platform dependent. Scylla timeout seems like more fitting and stable place even if we have to wait a bit.

@mykaul
Copy link
Contributor

mykaul commented Nov 2, 2023

Are we sure shutdown is not hung, which causes RPC connections not to close properly, from both sides?

@mykaul
Copy link
Contributor

mykaul commented Nov 2, 2023

@xemul looked at multiple RPC issues recently, may be he has some idea.

@zimnx
Copy link
Collaborator

zimnx commented Nov 3, 2023

Are we sure shutdown is not hung, which causes RPC connections not to close properly, from both sides?

Rollout proceeds, process dies some time after we drain it and kill it without need for ungraceful shutdown.
Kostia suggested it might be related to too short shutdown-announce-in-ms, but when I set it to 10minutes it didn't help. Details and logs from the run are available under scylladb/scylladb#15899

@zimnx
Copy link
Collaborator

zimnx commented Nov 7, 2023

After couple of days I have doubts whether timeout on rpc_client connection would solve the issue. I think, it would only postpone fixing the real issue.

The reason why timeout would help us, is because it would give nodes another attempt to connect to restarted node. New attempt usually means new source port, session would have new ip/port tuple which requires iptables traversal. Effectively using updated ClusterIP mapping. And this is what fixes the issue we hit with ClusterIP clusters.
The problem is that it's not reliable.

  1. If for some reason, new attempt wouldn't be successful or would take longer to connect, timeout on other nodes wouldn't prevent node being restarted to pass readiness check.
  2. If Scylla would be able to boot within seconds, timeout set to value higher than boot time wouldn't prevent restarted node from passing readiness check.
  3. New source port is not guaranteed. In case it wouldn't be, nodes would be stuck with the same stale conntrack entry preventing from using updated ClusterIP mapping.
  4. Setting timeout to too low value, would cause troubles on environments with slow connection between nodes.

We can force connection timeout by namespaced net.ipv4.tcp_syn_retries which would mimic the proposed timeout only in Scylla Pod. It wouldn't help in all these cases by itself. Additionally we could set non-namespaced net.netfilter.nf_conntrack_tcp_timeout_syn_sent but it would only help with case 3. Although on setups where ClusterIP is NATed not using netfliter but something else, similar situation with stale entry in NAT map may happen. We won't be able to solve it once for all.

Looks like even with configurable timeout, we wouldn't be able to survive rollouts without availability loss, because UN information from single node is not enough.
We need to be sure every required connection is up both ways. Gossip is exchanged with one connection. Depending on the timing, some connections may be established, and some may be stuck. If one of the established ones was gossip connection, we could have UN state agreement across all nodes if just gossip exchange is enough to call it UN, I don't know, worth asking core.

Seems like one of the first idea about asking multiple nodes about status of ourselves would be a good way to solve these availability issues. Together with recommendation to set additional sysctls they should make readiness pass quick enough.
I wonder if we will be able to ask the status of a given node using raft.

@zimnx
Copy link
Collaborator

zimnx commented Nov 7, 2023

I wonder if we will be able to ask the status of a given node using raft.

We won't.
Status of a node consist of two states, node state (normal, bootstrapping, joining, leaving) and state of failure detector (up/down). Node state will be consistent across cluster with raft, but failure detector state will still be per node.

Copy link
Contributor

The Scylla Operator project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 30d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out

/lifecycle stale

@scylla-operator-bot scylla-operator-bot bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 30, 2024
Copy link
Contributor

The Scylla Operator project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 30d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out

/lifecycle rotten

@scylla-operator-bot scylla-operator-bot bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jul 31, 2024
@zimnx
Copy link
Collaborator

zimnx commented Jul 31, 2024

/remove-lifecycle rotten
/triage accepted

@scylla-operator-bot scylla-operator-bot bot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. labels Jul 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
7 participants