Multiple E2E tests time out waiting for Scylla Cluster to roll out when running on arm64 machines #1997

rzetelskik · 2024-07-01T12:52:54Z

https://prow.scylla-operator.scylladb.com/view/gs/scylla-operator-prow/logs/ci-scylla-operator-latest-e2e-gke-arm64-parallel/1807655512678862848#1:test-build-log.txt%3A2566

Summarizing 8 Failures:
  [FAIL] ScyllaCluster [It] should set up Alternator API when enabled
  github.com/scylladb/scylla-operator/test/e2e/set/scyllacluster/scyllacluster_alternator.go:106
  [FAIL] ScyllaCluster webhook [It] should forbid invalid requests
  github.com/scylladb/scylla-operator/test/e2e/set/scyllacluster/scyllacluster_webhooks.go:91
  [FAIL] ScyllaCluster should be exposed [It] using ClusterIP Service, and broadcast it to both client and nodes
  github.com/scylladb/scylla-operator/test/e2e/set/scyllacluster/scyllacluster_expose.go:184
  [FAIL] ScyllaCluster [It] should connect to cluster via Ingresses
  github.com/scylladb/scylla-operator/test/e2e/set/scyllacluster/scyllacluster_expose.go:81
  [FAIL] ScyllaCluster [It] should setup and maintain up to date TLS certificates
  github.com/scylladb/scylla-operator/test/e2e/set/scyllacluster/scyllacluster_tls.go:68
  [FAIL] ScyllaCluster should be exposed [It] using LoadBalancer Service, and broadcast external IP to client and PodIP to nodes
  github.com/scylladb/scylla-operator/test/e2e/set/scyllacluster/scyllacluster_expose.go:184
  [FAIL] ScyllaCluster [It] should rotate TLS certificates before they expire
  github.com/scylladb/scylla-operator/test/e2e/set/scyllacluster/scyllacluster_tls.go:283
  [FAIL] ScyllaCluster [It] should create ingress objects when ingress exposeOptions are provided
  github.com/scylladb/scylla-operator/test/e2e/set/scyllacluster/scyllacluster_expose.go:425

When running on arm64 machines, multiple tests in our e2e parallel suite fail waiting for Scylla Cluster to roll out.
We should determine whether it's caused by slower disks (Tau T2A machines do not support local SSDs) or if it's infrastructure specific.

/kind failing-test

The text was updated successfully, but these errors were encountered:

rzetelskik · 2024-07-04T08:52:03Z

Summarizing 4 Failures:
  [FAIL] ScyllaCluster [It] should set up Alternator API when enabled
  github.com/scylladb/scylla-operator/test/e2e/set/scyllacluster/scyllacluster_alternator.go:106
  [FAIL] ScyllaCluster [It] should reconcile resource changes
  github.com/scylladb/scylla-operator/test/e2e/set/scyllacluster/verify.go:312
  [FAIL] Scylla Manager integration should register cluster, sync backup tasks and support manual restore procedure [It] using workaround for consistent_cluster_management for ScyllaDB Enterprise 2024.1.X [RequiresObjectStorage]
  github.com/scylladb/scylla-operator/test/e2e/set/scyllacluster/scyllamanager_object_storage.go:77
  [FAIL] ScyllaCluster authentication [It] agent requires authentication
  github.com/scylladb/scylla-operator/test/e2e/set/scyllacluster/verify.go:312

Hit a different set of tests (only one repeated) on a second run.

rzetelskik · 2024-07-15T10:51:13Z

  [FAIL] ScyllaCluster [It] should set up Alternator API when enabled
  github.com/scylladb/scylla-operator/test/e2e/set/scyllacluster/scyllacluster_alternator.go:106
  [FAIL] ScyllaCluster should be exposed [It] using LoadBalancer Service, and broadcast external IP to client and PodIP to nodes
  github.com/scylladb/scylla-operator/test/e2e/set/scyllacluster/scyllacluster_expose.go:184
  [FAIL] ScyllaCluster webhook [It] should forbid invalid requests
  github.com/scylladb/scylla-operator/test/e2e/set/scyllacluster/verify.go:312
  [FAIL] ScyllaCluster [It] should connect to cluster via Ingresses
  github.com/scylladb/scylla-operator/test/e2e/set/scyllacluster/scyllacluster_expose.go:81
Ran 35 of 42 Specs in 2613.937 seconds
push_pin
FAIL! -- 31 Passed | 4 Failed | 0 Pending | 7 Skipped

https://prow.scylla-operator.scylladb.com/view/gs/scylla-operator-prow/logs/ci-scylla-operator-latest-e2e-gke-arm64-parallel/1812728893174452224#1:test-build-log.txt%3A2526

rzetelskik · 2024-07-25T09:30:44Z

Summarizing 7 Failures:
  [FAIL] ScyllaCluster [It] should set up Alternator API when enabled
  github.com/scylladb/scylla-operator/test/e2e/set/scyllacluster/scyllacluster_alternator.go:106
  [FAIL] ScyllaCluster [It] should connect to cluster via Ingresses
  github.com/scylladb/scylla-operator/test/e2e/set/scyllacluster/scyllacluster_expose.go:81
  [FAIL] ScyllaCluster [It] should rotate TLS certificates before they expire
  github.com/scylladb/scylla-operator/test/e2e/set/scyllacluster/scyllacluster_tls.go:283
  [FAIL] ScyllaCluster should be exposed [It] using LoadBalancer Service, and broadcast external IP to client and PodIP to nodes
  github.com/scylladb/scylla-operator/test/e2e/set/scyllacluster/scyllacluster_expose.go:184
  [FAIL] ScyllaCluster webhook [It] should forbid invalid requests
  github.com/scylladb/scylla-operator/test/e2e/set/scyllacluster/scyllacluster_webhooks.go:91
  [FAIL] ScyllaCluster [It] should create ingress objects when ingress exposeOptions are provided
  github.com/scylladb/scylla-operator/test/e2e/set/scyllacluster/scyllacluster_expose.go:425
  [FAIL] Scylla Manager integration [It] should register cluster and sync repair tasks
  github.com/scylladb/scylla-operator/test/e2e/set/scyllacluster/verify.go:312
Ran 35 of 42 Specs in 2648.979 seconds
FAIL! -- 28 Passed | 7 Failed | 0 Pending | 7 Skipped

https://prow.scylla-operator.scylladb.com/view/gs/scylla-operator-prow/logs/ci-scylla-operator-latest-e2e-gke-arm64-parallel/1815265747585208320#1:test-build-log.txt%3A2625

rzetelskik · 2024-07-25T16:02:56Z

All of the above runs had rotated scylla-operator logs so I had to trigger another one and collect the operator logs at runtime.

https://prow.scylla-operator.scylladb.com/view/gs/scylla-operator-prow/logs/ci-scylla-operator-latest-e2e-gke-arm64-parallel/1816472949608157184

arm-scylla-operator-logs.tar.gz

Same for the corresponding periodic on amd, for comparison:

https://prow.scylla-operator.scylladb.com/view/gs/scylla-operator-prow/logs/ci-scylla-operator-latest-e2e-gke/1816464102701142016

amd-scylla-operator-logs.tar.gz

Looking at the operator logs from the arm run, we can notice multiple sync loops taking up to a couple of minutes (!):

$ cat scylla-operator.current | grep -Pe '"Finished syncing ScyllaCluster"' | grep -Pe 'duration="\d+m\d+.?\d+s"'
I0725 14:11:10.756479       1 scyllacluster/sync.go:35] "Finished syncing ScyllaCluster" ScyllaCluster="e2e-test-scyllacluster-22284-0-8ckjq/basic-mdgcm" duration="2m7.728104396s"
I0725 14:13:14.374062       1 scyllacluster/sync.go:35] "Finished syncing ScyllaCluster" ScyllaCluster="e2e-test-scyllacluster-42ngm-0-fb6ng/basic-8lq6q" duration="4m11.345519112s"
I0725 14:13:17.246423       1 scyllacluster/sync.go:35] "Finished syncing ScyllaCluster" ScyllaCluster="e2e-test-scyllacluster-dxhdg-0-fzfw5/basic-ht4p2" duration="4m14.217749341s"
I0725 14:13:30.349335       1 scyllacluster/sync.go:35] "Finished syncing ScyllaCluster" ScyllaCluster="e2e-test-scyllacluster-vxv5r-0-4lnkm/basic-2qmxc" duration="4m27.318979934s"
I0725 14:13:43.533906       1 scyllacluster/sync.go:35] "Finished syncing ScyllaCluster" ScyllaCluster="e2e-test-scyllacluster-895ks-0-d4xkr/basic-xgws2" duration="4m40.505290246s"
I0725 14:13:50.272530       1 scyllacluster/sync.go:35] "Finished syncing ScyllaCluster" ScyllaCluster="e2e-test-scyllacluster-jw9gc-0-vg8wx/basic-km6mm" duration="4m47.243098724s"
I0725 14:13:53.858934       1 scyllacluster/sync.go:35] "Finished syncing ScyllaCluster" ScyllaCluster="e2e-test-scyllacluster-n8rjv-0-bnrwj/basic-l7hkf" duration="4m50.829226727s"
I0725 14:13:55.360155       1 scyllacluster/sync.go:35] "Finished syncing ScyllaCluster" ScyllaCluster="e2e-test-scyllacluster-chg8n-0-5cgl9/basic-kg57q" duration="4m52.331499482s"
I0725 14:13:55.697009       1 scyllacluster/sync.go:35] "Finished syncing ScyllaCluster" ScyllaCluster="e2e-test-scyllacluster-tzdvt-0-9c2bh/basic-rnqw2" duration="4m52.66683305s"
I0725 14:13:59.000582       1 scyllacluster/sync.go:35] "Finished syncing ScyllaCluster" ScyllaCluster="e2e-test-scyllacluster-s6mrn-0-fshxv/basic-k48nn" duration="4m55.970668727s"
I0725 14:14:01.170888       1 scyllacluster/sync.go:35] "Finished syncing ScyllaCluster" ScyllaCluster="e2e-test-scyllacluster-x5cff-0-z9srt/basic-p7gnl" duration="4m58.140463418s"
I0725 14:14:04.734596       1 scyllacluster/sync.go:35] "Finished syncing ScyllaCluster" ScyllaCluster="e2e-test-scylladbmonitoring-m4lh9-0-4fkmk/basic-ldz7m" duration="5m1.70386554s"
I0725 14:14:05.621209       1 scyllacluster/sync.go:35] "Finished syncing ScyllaCluster" ScyllaCluster="e2e-test-scyllacluster-rrc4k-0-6bxdg/basic-vxkh6" duration="5m2.591409241s"
I0725 14:14:07.119746       1 scyllacluster/sync.go:35] "Finished syncing ScyllaCluster" ScyllaCluster="e2e-test-scyllacluster-hrnhf-0-jgqkc/basic-d4c52" duration="5m4.090724515s"
I0725 14:14:11.243630       1 scyllacluster/sync.go:35] "Finished syncing ScyllaCluster" ScyllaCluster="e2e-test-scyllacluster-p7szv-0-9m26f/basic-b9rjk" duration="5m8.213871571s"
I0725 14:14:12.106736       1 scyllacluster/sync.go:35] "Finished syncing ScyllaCluster" ScyllaCluster="e2e-test-scyllacluster-zxbqd-0-tqsd9/basic-n88cp" duration="5m9.07602655s"
I0725 14:14:14.830402       1 scyllacluster/sync.go:35] "Finished syncing ScyllaCluster" ScyllaCluster="e2e-test-scyllacluster-jfn68-0-2jfk7/basic-cluster" duration="5m11.801252693s"
I0725 14:14:17.101075       1 scyllacluster/sync.go:35] "Finished syncing ScyllaCluster" ScyllaCluster="e2e-test-scyllacluster-gwn8d-0-nz848/basic-gml5h" duration="5m14.072196305s"
I0725 14:14:19.261568       1 scyllacluster/sync.go:35] "Finished syncing ScyllaCluster" ScyllaCluster="e2e-test-scyllacluster-jgtk4-0-qfmv8/basic-xq4bc" duration="5m16.232216795s"
I0725 14:14:20.861825       1 scyllacluster/sync.go:35] "Finished syncing ScyllaCluster" ScyllaCluster="e2e-test-scyllacluster-2fws8-0-cs67t/basic-7xzt6" duration="5m17.833448151s"
...

Looking in detail into the logs related to a selected one, e.g. e2e-test-scyllacluster-2fws8-0-cs67t/basic-7xzt6, what stands out are oustandingly long certificate creation times:

$ cat scylla-operator.current | grep -Pe "basic-7xzt6" | grep -Pe "Creating certificate|Certificate created"
I0725 14:09:06.446762       1 kubecrypto/certs.go:201] "Creating certificate" Secret="e2e-test-scyllacluster-2fws8-0-cs67t/basic-7xzt6-local-client-ca" Reason="needs new cert"
I0725 14:09:48.664868       1 kubecrypto/certs.go:206] "Certificate created" Secret="e2e-test-scyllacluster-2fws8-0-cs67t/basic-7xzt6-local-client-ca"
I0725 14:09:48.708975       1 kubecrypto/certs.go:201] "Creating certificate" Secret="e2e-test-scyllacluster-2fws8-0-cs67t/basic-7xzt6-local-user-admin" Reason="needs new cert"
I0725 14:11:25.862102       1 kubecrypto/certs.go:206] "Certificate created" Secret="e2e-test-scyllacluster-2fws8-0-cs67t/basic-7xzt6-local-user-admin"
I0725 14:11:25.887837       1 kubecrypto/certs.go:201] "Creating certificate" Secret="e2e-test-scyllacluster-2fws8-0-cs67t/basic-7xzt6-local-serving-ca" Reason="needs new cert"
I0725 14:12:53.069222       1 kubecrypto/certs.go:206] "Certificate created" Secret="e2e-test-scyllacluster-2fws8-0-cs67t/basic-7xzt6-local-serving-ca"
I0725 14:12:53.104701       1 kubecrypto/certs.go:201] "Creating certificate" Secret="e2e-test-scyllacluster-2fws8-0-cs67t/basic-7xzt6-local-serving-certs" Reason="needs new cert"
I0725 14:14:20.866875       1 kubecrypto/certs.go:201] "Creating certificate" Secret="e2e-test-scyllacluster-2fws8-0-cs67t/basic-7xzt6-local-serving-certs" Reason="needs new cert"
I0725 14:16:07.585097       1 kubecrypto/certs.go:206] "Certificate created" Secret="e2e-test-scyllacluster-2fws8-0-cs67t/basic-7xzt6-local-serving-certs"
I0725 14:16:17.713971       1 kubecrypto/certs.go:201] "Creating certificate" Secret="e2e-test-scyllacluster-2fws8-0-cs67t/basic-7xzt6-local-serving-certs" Reason="certificate needs an update"
I0725 14:17:57.205012       1 kubecrypto/certs.go:206] "Certificate created" Secret="e2e-test-scyllacluster-2fws8-0-cs67t/basic-7xzt6-local-serving-certs"
I0725 14:18:41.995833       1 kubecrypto/certs.go:201] "Creating certificate" Secret="e2e-test-scyllacluster-2fws8-0-cs67t/basic-7xzt6-local-serving-certs" Reason="certificate needs an update"
I0725 14:19:30.792616       1 kubecrypto/certs.go:206] "Certificate created" Secret="e2e-test-scyllacluster-2fws8-0-cs67t/basic-7xzt6-local-serving-certs"
I0725 14:19:45.431125       1 kubecrypto/certs.go:201] "Creating certificate" Secret="e2e-test-scyllacluster-2fws8-0-cs67t/basic-7xzt6-local-serving-certs" Reason="certificate needs an update"
I0725 14:20:40.414144       1 kubecrypto/certs.go:206] "Certificate created" Secret="e2e-test-scyllacluster-2fws8-0-cs67t/basic-7xzt6-local-serving-certs"
I0725 14:20:50.462781       1 kubecrypto/certs.go:201] "Creating certificate" Secret="e2e-test-scyllacluster-2fws8-0-cs67t/basic-7xzt6-local-serving-certs" Reason="certificate needs an update"
I0725 14:21:13.963127       1 kubecrypto/certs.go:206] "Certificate created" Secret="e2e-test-scyllacluster-2fws8-0-cs67t/basic-7xzt6-local-serving-certs"
I0725 14:21:41.906365       1 kubecrypto/certs.go:201] "Creating certificate" Secret="e2e-test-scyllacluster-2fws8-0-cs67t/basic-7xzt6-local-serving-certs" Reason="certificate needs an update"
I0725 14:21:51.758825       1 kubecrypto/certs.go:206] "Certificate created" Secret="e2e-test-scyllacluster-2fws8-0-cs67t/basic-7xzt6-local-serving-certs"
I0725 14:23:52.682616       1 kubecrypto/certs.go:201] "Creating certificate" Secret="e2e-test-scyllacluster-2fws8-0-cs67t/basic-7xzt6-local-serving-certs" Reason="certificate needs an update"
I0725 14:23:52.712412       1 kubecrypto/certs.go:206] "Certificate created" Secret="e2e-test-scyllacluster-2fws8-0-cs67t/basic-7xzt6-local-serving-certs"
I0725 14:24:42.351605       1 kubecrypto/certs.go:201] "Creating certificate" Secret="e2e-test-scyllacluster-2fws8-0-cs67t/basic-7xzt6-local-serving-certs" Reason="certificate needs an update"
I0725 14:24:42.399630       1 kubecrypto/certs.go:206] "Certificate created" Secret="e2e-test-scyllacluster-2fws8-0-cs67t/basic-7xzt6-local-serving-certs"
I0725 14:25:52.819280       1 kubecrypto/certs.go:201] "Creating certificate" Secret="e2e-test-scyllacluster-2fws8-0-cs67t/basic-7xzt6-local-serving-certs" Reason="certificate needs an update"
I0725 14:25:52.854002       1 kubecrypto/certs.go:201] "Creating certificate" Secret="e2e-test-scyllacluster-2fws8-0-cs67t/basic-7xzt6-local-serving-certs" Reason="certificate needs an update"
I0725 14:25:52.906009       1 kubecrypto/certs.go:201] "Creating certificate" Secret="e2e-test-scyllacluster-2fws8-0-cs67t/basic-7xzt6-local-serving-certs" Reason="certificate needs an update"
I0725 14:25:53.007226       1 kubecrypto/certs.go:201] "Creating certificate" Secret="e2e-test-scyllacluster-2fws8-0-cs67t/basic-7xzt6-local-serving-certs" Reason="certificate needs an update"
I0725 14:25:53.061783       1 kubecrypto/certs.go:201] "Creating certificate" Secret="e2e-test-scyllacluster-2fws8-0-cs67t/basic-7xzt6-local-serving-certs" Reason="certificate needs an update"
I0725 14:25:53.088583       1 kubecrypto/certs.go:201] "Creating certificate" Secret="e2e-test-scyllacluster-2fws8-0-cs67t/basic-7xzt6-local-serving-certs" Reason="certificate needs an update"
...

During the ~10 minutes after the initial burst, key generation completes 237 times:

$ cat scylla-operator.current | grep -Pe "Generating item finished" | grep -Pe "I0725 14:(09|1\d):\d+.?\d* " | wc -l
237

With the following average duration:

$ cat scylla-operator.current | grep -Pe "Generating item finished" | grep -Pe "I0725 14:(09|1\d):\d+.?\d* " | awk 'match($0, /Duration="([0-9.]+)(ms|s)"/, a) { d = a[1]; if (a[2] == "ms") { d = d / 1000; }; s += d; c++; } END { print s / c, "s" ; }'
5.39327 s

Meanwhile, for the amd run:

$ cat scylla-operator.current | grep -Pe "Generating item finished" | grep -Pe "I0725 13:(3[4-9]|4[0-3]):\d+.?\d* " | wc -l
559
$ cat scylla-operator.current | grep -Pe "Generating item finished" | grep -Pe "I0725 13:(3[4-9]|4[0-5]):\d+.?\d* " | awk 'match($0, /Duration="([0-9.]+)(ms|s)"/, a) { d = a[1]; if (a[2] == "ms") { d = d / 1000; }; s += d; c++; } END { print s / c, "s" ; }'
1.69285 s

Although the difference doesn't seem that drastic, the wait is enough to push the tests over the timeouts.

Both periodics are running Scylla Operator with the following settings:

I0725 14:08:48.702405       1 flag/flags.go:64] FLAG: --crypto-key-buffer-delay="200ms"
I0725 14:08:48.702409       1 flag/flags.go:64] FLAG: --crypto-key-buffer-size-max="30"
I0725 14:08:48.702412       1 flag/flags.go:64] FLAG: --crypto-key-buffer-size-min="10"

I see a few ways to continue from this spot:

assessing whether the difference in key generation time is expected with the selected setups
given the difference is expected, increasing crypto-key-buffer-size-min (I haven't tested this or thought about the implications yet)
given the difference is expected, limiting the parallelism

tnozicka · 2024-07-26T08:01:03Z

Thanks for looking into the runs. From your finding the key generation seems about 3,2 times slower on arm, which if if attributed as the major factor, the amd sync loops peak would be around 1m 33s - is that so? Both seem quite long to me.

assessing whether the difference in key generation time is expected with the selected setups

pretty much, but we just use stdlib

given the difference is expected, increasing crypto-key-buffer-size-min (I haven't tested this or thought about the implications yet)

e2e tests need marginally more certificates than a min, so I wouldn't hold your breath for that one

given the difference is expected, limiting the parallelism

I think adjusting the parallelism per CI job seems like a reasonable action to take.

Have you tried adjusting operator cpu requests? We are pretty low (10m) and proportionally to other workloads that can suck.
Also for periodic we should setup a the nodes closer to the production with dedicated pools, that would also help if the cpu is being "stolen" by a workload.

rzetelskik · 2024-07-26T09:51:23Z

Thanks for looking into the runs. From your finding the key generation seems about 3,2 times slower on arm, which if if attributed as the major factor, the amd sync loops peak would be around 1m 33s - is that so? Both seem quite long to me.

Roughly, yes.

$ cat scylla-operator.current | grep -Pe '"Finished syncing ScyllaCluster"' | grep -Pe 'duration="\d+m\d+.?\d+s"'
I0725 13:35:27.995150       1 scyllacluster/sync.go:35] "Finished syncing ScyllaCluster" ScyllaCluster="e2e-test-scyllacluster-lfp8q-0-jhcrz/basic-t2lqx" duration="1m6.499951198s"
I0725 13:35:30.434984       1 scyllacluster/sync.go:35] "Finished syncing ScyllaCluster" ScyllaCluster="e2e-test-scyllacluster-dc7kg-0-pw2ls/basic-7w299" duration="1m9.010835583s"
I0725 13:35:35.864480       1 scyllacluster/sync.go:35] "Finished syncing ScyllaCluster" ScyllaCluster="e2e-test-scyllacluster-5z6v6-0-7xn4l/basic-4fx9m" duration="1m14.333902012s"
I0725 13:35:36.680612       1 scyllacluster/sync.go:35] "Finished syncing ScyllaCluster" ScyllaCluster="e2e-test-scyllacluster-89fp2-0-f4pxt/basic-btbfh" duration="1m15.499290731s"
I0725 13:35:38.914919       1 scyllacluster/sync.go:35] "Finished syncing ScyllaCluster" ScyllaCluster="e2e-test-scyllacluster-gf9wc-0-8ghhc/basic-tlx6t" duration="1m18.294550206s"
I0725 13:35:49.796967       1 scyllacluster/sync.go:35] "Finished syncing ScyllaCluster" ScyllaCluster="e2e-test-scyllacluster-zqtj8-0-cmr8b/basic-75kw7" duration="1m27.345351206s"
I0725 13:35:51.311263       1 scyllacluster/sync.go:35] "Finished syncing ScyllaCluster" ScyllaCluster="e2e-test-scyllacluster-xssdd-0-n7qb2/basic-kblqm" duration="1m29.596688644s"
I0725 13:35:51.658069       1 scyllacluster/sync.go:35] "Finished syncing ScyllaCluster" ScyllaCluster="e2e-test-scyllacluster-r4gwr-0-mx8sz/basic-l9tc5" duration="1m30.248506202s"
I0725 13:35:52.631809       1 scyllacluster/sync.go:35] "Finished syncing ScyllaCluster" ScyllaCluster="e2e-test-scyllacluster-95f6r-0-zxk5d/basic-98f4w" duration="1m31.964820087s"
I0725 13:35:53.067595       1 scyllacluster/sync.go:35] "Finished syncing ScyllaCluster" ScyllaCluster="e2e-test-scyllacluster-zb9hm-0-lwz5r/basic-d4tm4" duration="1m31.620670181s"
I0725 13:35:54.280823       1 scyllacluster/sync.go:35] "Finished syncing ScyllaCluster" ScyllaCluster="e2e-test-scyllacluster-kmgxc-0-qvdmf/basic-jzh9m" duration="1m30.207411474s"
I0725 13:35:55.333719       1 scyllacluster/sync.go:35] "Finished syncing ScyllaCluster" ScyllaCluster="e2e-test-scyllacluster-gqfln-0-qpr2c/basic-cluster" duration="1m30.863168538s"
I0725 13:35:55.443259       1 scyllacluster/sync.go:35] "Finished syncing ScyllaCluster" ScyllaCluster="e2e-test-scyllacluster-mkq9g-0-zfp8b/basic-jzg6b" duration="1m35.037964324s"
I0725 13:35:57.718797       1 scyllacluster/sync.go:35] "Finished syncing ScyllaCluster" ScyllaCluster="e2e-test-scyllacluster-cms6b-0-8k7wb/basic-8wkgn" duration="1m36.104571344s"
I0725 13:35:58.830183       1 scyllacluster/sync.go:35] "Finished syncing ScyllaCluster" ScyllaCluster="e2e-test-scyllacluster-5jtwt-0-nk4cr/basic-28pcd" duration="1m37.795002155s"
I0725 13:36:01.446831       1 scyllacluster/sync.go:35] "Finished syncing ScyllaCluster" ScyllaCluster="e2e-test-scyllacluster-pkd4f-0-kl2w8/basic-6qq8b" duration="1m37.014963923s"
I0725 13:36:01.929815       1 scyllacluster/sync.go:35] "Finished syncing ScyllaCluster" ScyllaCluster="e2e-test-scyllacluster-6cftn-0-s9hqt/basic-t46q9" duration="1m40.521120484s"
I0725 13:36:02.119672       1 scyllacluster/sync.go:35] "Finished syncing ScyllaCluster" ScyllaCluster="e2e-test-scyllacluster-gn5v8-0-2pvjq/basic-lwdxf" duration="1m37.984944679s"
I0725 13:36:03.372593       1 scyllacluster/sync.go:35] "Finished syncing ScyllaCluster" ScyllaCluster="e2e-test-scyllacluster-rfr42-0-7ws9d/basic-x4mzq" duration="1m42.321278883s"
I0725 13:36:04.436859       1 scyllacluster/sync.go:35] "Finished syncing ScyllaCluster" ScyllaCluster="e2e-test-scylladbmonitoring-h7z9l-0-r5848/basic-f8ksq" duration="1m43.855739567s"
I0725 13:36:05.512803       1 scyllacluster/sync.go:35] "Finished syncing ScyllaCluster" ScyllaCluster="e2e-test-scyllacluster-zrkbh-0-jm9ql/basic-r6rcs" duration="1m41.439237718s"
I0725 13:36:07.164390       1 scyllacluster/sync.go:35] "Finished syncing ScyllaCluster" ScyllaCluster="e2e-test-scyllacluster-lj4f9-0-f95sd/basic-9w2cb" duration="1m44.523659635s"
I0725 13:36:08.255956       1 scyllacluster/sync.go:35] "Finished syncing ScyllaCluster" ScyllaCluster="e2e-test-scyllacluster-zld8s-0-vz27s/basic-lt5qz" duration="1m42.435573268s"
I0725 13:36:08.747910       1 scyllacluster/sync.go:35] "Finished syncing ScyllaCluster" ScyllaCluster="e2e-test-scyllacluster-7wk58-0-ggwf5/basic-b6pmn" duration="1m44.016653585s"
I0725 13:36:09.785480       1 scyllacluster/sync.go:35] "Finished syncing ScyllaCluster" ScyllaCluster="e2e-test-scyllacluster-cc9wv-0-bspk9/basic-pqxhl" duration="1m45.342717379s"
I0725 13:36:10.670517       1 scyllacluster/sync.go:35] "Finished syncing ScyllaCluster" ScyllaCluster="e2e-test-scyllacluster-hhgpp-0-v9ggw/basic-zjxh8" duration="1m45.374712765s"
I0725 13:36:11.728104       1 scyllacluster/sync.go:35] "Finished syncing ScyllaCluster" ScyllaCluster="e2e-test-scyllacluster-hcm9t-0-ggkhd/basic-bgf5k" duration="1m46.906860991s"
I0725 13:36:13.690345       1 scyllacluster/sync.go:35] "Finished syncing ScyllaCluster" ScyllaCluster="e2e-test-scyllacluster-kc5n5-0-khqdd/basic-srzxf" duration="1m49.616922117s"
I0725 13:36:13.903032       1 scyllacluster/sync.go:35] "Finished syncing ScyllaCluster" ScyllaCluster="e2e-test-scyllacluster-btnpz-0-mv2xj/basic-4djtw" duration="1m50.014991322s"
I0725 13:36:14.815469       1 scyllacluster/sync.go:35] "Finished syncing ScyllaCluster" ScyllaCluster="e2e-test-scyllacluster-sm8px-0-f9txh/basic-rzhmc" duration="1m50.149596829s"
I0725 13:36:15.860553       1 scyllacluster/sync.go:35] "Finished syncing ScyllaCluster" ScyllaCluster="e2e-test-scyllacluster-w62bx-0-dlvld/basic-8twh5" duration="1m48.275981026s"
I0725 13:36:40.958250       1 scyllacluster/sync.go:35] "Finished syncing ScyllaCluster" ScyllaCluster="e2e-test-scyllacluster-5567k-0-vsgx5/basic-4rbsg" duration="2m19.535899999s"

e2e tests need marginally more certificates than a min, so I wouldn't hold your breath for that one

How so? As per the report above, the number of generated keys in the ~10 minutes after the e2e workers started was ~250, while the min was set to 10. The workers all wait on a single queue to get their certificates generated, so imo this disproportion looks concerning.

Have you tried adjusting operator cpu requests? We are pretty low (10m) and proportionally to other workloads that can suck.

I haven't, testing different setups either requires pushing changes to the CI or a lengthy manual process, hence I wanted to consult the approach first. Is there a reason we're running operator on that small of a cpu share in CI now?

tnozicka · 2024-07-26T14:19:30Z

How so? As per the report above, the number of generated keys in the ~10 minutes after the e2e workers started was ~250, while the min was set to 10.

The min is also the number of certs you'll generate as waste before the end of your run.

It serves to distribute the needed compute time across the curve but from the 5m sync time I assume the operator is constantly under load.

I haven't, testing different setups either requires pushing changes to the CI or a lengthy manual process, hence I wanted to consult the approach first. Is there a reason we're running operator on that small of a cpu share in CI now?

Historically to deploy to very small platforms and it didn't generate certs back then. 100m or so still lgtm, if it helps.

rzetelskik · 2024-07-31T12:28:58Z

Raising the operator's cpu requests to 100m didn't help (in fact the peak reconciliation times were even higher, but I wouldn't take that as a direct consequence, or a significant result of the one run whatsoever).

We discussed it internally and agreed on a couple of actionable items:

further increase operator's cpu requests (not in the general-purpose deploy manifests this time, but through ENV variables passed to ci scripts)
limit parallelism (again requires ENV variables in CI scripts)
observe the min/max values for key generation and raise them if the buffer sits empty or hits the rate limiting too often

rzetelskik · 2024-07-31T15:05:21Z

Further increasing operator's cpu requests to 500m didn't seem to decrease reconciliation times at all, it seems that in setups we're using in CI operator already sits at higher resource consumption. Decreasing parallelisation by roughly a half seems like a reasonable next step.

zimnx · 2024-07-31T15:09:13Z

Further increasing operator's cpu requests to 500m didn't seem to decrease reconciliation times at all, it seems that in setups we're using in CI operator already sits at higher resource consumption. Decreasing parallelisation by roughly a half seems like a reasonable next step.

I'm not sure increasing what we request would help anyhow. Operator lives on separate to Scylla nodes, which are not busy. Hence it may still get all cores available, regardless of cpu requests settings.

rzetelskik · 2024-07-31T15:37:19Z

Operator lives on separate to Scylla nodes, which are not busy. Hence it may still get all cores available, regardless of cpu requests settings.

We create the dedicated infra node pools, but none of the workloads actually have the toleration for the dedicated-pool taint.
Perhaps this is something worth fixing before adjusting the config.
Iirc we had an issue for it in one of the repositories but I can't find anything right now, @tnozicka do you recall if we have it?

zimnx · 2024-07-31T15:51:01Z

We create the dedicated infra node pools, but none of the workloads actually have the toleration for the dedicated-pool taint.

Then indeed it fights with others. We should fix that, as we have nodes (not huge ones) that are billable but nothing is being executed there.

tnozicka · 2024-08-01T07:40:44Z

I'm not sure increasing what we request would help anyhow.

In the best effort class you get cpu quota proportional to your requests but given the sheer number if scyllacluster pods around that's still a small fraction.

We create the dedicated infra node pools, but none of the workloads actually have the toleration for the dedicated-pool taint.

yeah, the operator goes into the workload pool
nodeName: gke-so-d8edd813-89ae-4968-95d-workers-73850519-jnss

Iirc we had an issue for it in one of the repositories but I can't find anything right now, @tnozicka do you recall if we have it?

I don't have it but it's something I've bumped as an issue a few times lately. We should define well named taints and tolerations prefixed with our domain and apply that pattern through our deploy files, examples and the docs. It has it's own issues as explained in #2049

Even with dedicated nodes, the operator may steel cpu cycles from say the admission webhook and make it timeout...

Another option is to use a Guaranteed QoS class. That's less resource sharing but more predictable and reproducible results. It's also orthogonal to the taints and tolerations.

rzetelskik · 2024-08-05T10:13:19Z

https://prow.scylla-operator.scylladb.com/view/gs/scylla-operator-prow/logs/ci-scylla-operator-latest-e2e-gke-arm64-parallel/1820339182455754752
passed with lower parallelism

I suggest we close this issue and take care of #2049 as a followup. @tnozicka should I take this one?

tnozicka · 2024-08-05T10:32:22Z

sounds good, thx

scylla-operator-bot bot added kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Jul 1, 2024

tnozicka assigned rzetelskik Jul 1, 2024

tnozicka added priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. and removed needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Jul 1, 2024

This was referenced Jul 29, 2024

Add duration to certificate creation logs #2040

Merged

Increase operator cpu request #2041

Merged

rzetelskik mentioned this issue Jul 31, 2024

Make parallelism configurable in e2e lib #2047

Merged

tnozicka closed this as completed Aug 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multiple E2E tests time out waiting for Scylla Cluster to roll out when running on arm64 machines #1997

Multiple E2E tests time out waiting for Scylla Cluster to roll out when running on arm64 machines #1997

rzetelskik commented Jul 1, 2024

rzetelskik commented Jul 4, 2024

rzetelskik commented Jul 15, 2024

rzetelskik commented Jul 25, 2024

rzetelskik commented Jul 25, 2024 •

edited

Loading

tnozicka commented Jul 26, 2024

rzetelskik commented Jul 26, 2024 •

edited

Loading

tnozicka commented Jul 26, 2024

rzetelskik commented Jul 31, 2024

rzetelskik commented Jul 31, 2024

zimnx commented Jul 31, 2024

rzetelskik commented Jul 31, 2024

zimnx commented Jul 31, 2024

tnozicka commented Aug 1, 2024

rzetelskik commented Aug 5, 2024

tnozicka commented Aug 5, 2024

Multiple E2E tests time out waiting for Scylla Cluster to roll out when running on arm64 machines #1997

Multiple E2E tests time out waiting for Scylla Cluster to roll out when running on arm64 machines #1997

Comments

rzetelskik commented Jul 1, 2024

rzetelskik commented Jul 4, 2024

rzetelskik commented Jul 15, 2024

rzetelskik commented Jul 25, 2024

rzetelskik commented Jul 25, 2024 • edited Loading

tnozicka commented Jul 26, 2024

rzetelskik commented Jul 26, 2024 • edited Loading

tnozicka commented Jul 26, 2024

rzetelskik commented Jul 31, 2024

rzetelskik commented Jul 31, 2024

zimnx commented Jul 31, 2024

rzetelskik commented Jul 31, 2024

zimnx commented Jul 31, 2024

tnozicka commented Aug 1, 2024

rzetelskik commented Aug 5, 2024

tnozicka commented Aug 5, 2024

rzetelskik commented Jul 25, 2024 •

edited

Loading

rzetelskik commented Jul 26, 2024 •

edited

Loading