Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple E2E tests time out waiting for Scylla Cluster to roll out when running on arm64 machines #1997

Closed
rzetelskik opened this issue Jul 1, 2024 · 15 comments
Assignees
Labels
kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now.

Comments

@rzetelskik
Copy link
Member

https://prow.scylla-operator.scylladb.com/view/gs/scylla-operator-prow/logs/ci-scylla-operator-latest-e2e-gke-arm64-parallel/1807655512678862848#1:test-build-log.txt%3A2566

Summarizing 8 Failures:
  [FAIL] ScyllaCluster [It] should set up Alternator API when enabled
  github.com/scylladb/scylla-operator/test/e2e/set/scyllacluster/scyllacluster_alternator.go:106
  [FAIL] ScyllaCluster webhook [It] should forbid invalid requests
  github.com/scylladb/scylla-operator/test/e2e/set/scyllacluster/scyllacluster_webhooks.go:91
  [FAIL] ScyllaCluster should be exposed [It] using ClusterIP Service, and broadcast it to both client and nodes
  github.com/scylladb/scylla-operator/test/e2e/set/scyllacluster/scyllacluster_expose.go:184
  [FAIL] ScyllaCluster [It] should connect to cluster via Ingresses
  github.com/scylladb/scylla-operator/test/e2e/set/scyllacluster/scyllacluster_expose.go:81
  [FAIL] ScyllaCluster [It] should setup and maintain up to date TLS certificates
  github.com/scylladb/scylla-operator/test/e2e/set/scyllacluster/scyllacluster_tls.go:68
  [FAIL] ScyllaCluster should be exposed [It] using LoadBalancer Service, and broadcast external IP to client and PodIP to nodes
  github.com/scylladb/scylla-operator/test/e2e/set/scyllacluster/scyllacluster_expose.go:184
  [FAIL] ScyllaCluster [It] should rotate TLS certificates before they expire
  github.com/scylladb/scylla-operator/test/e2e/set/scyllacluster/scyllacluster_tls.go:283
  [FAIL] ScyllaCluster [It] should create ingress objects when ingress exposeOptions are provided
  github.com/scylladb/scylla-operator/test/e2e/set/scyllacluster/scyllacluster_expose.go:425

When running on arm64 machines, multiple tests in our e2e parallel suite fail waiting for Scylla Cluster to roll out.
We should determine whether it's caused by slower disks (Tau T2A machines do not support local SSDs) or if it's infrastructure specific.

/kind failing-test

@scylla-operator-bot scylla-operator-bot bot added kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Jul 1, 2024
@tnozicka tnozicka added priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. and removed needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Jul 1, 2024
@rzetelskik
Copy link
Member Author

Summarizing 4 Failures:
  [FAIL] ScyllaCluster [It] should set up Alternator API when enabled
  github.com/scylladb/scylla-operator/test/e2e/set/scyllacluster/scyllacluster_alternator.go:106
  [FAIL] ScyllaCluster [It] should reconcile resource changes
  github.com/scylladb/scylla-operator/test/e2e/set/scyllacluster/verify.go:312
  [FAIL] Scylla Manager integration should register cluster, sync backup tasks and support manual restore procedure [It] using workaround for consistent_cluster_management for ScyllaDB Enterprise 2024.1.X [RequiresObjectStorage]
  github.com/scylladb/scylla-operator/test/e2e/set/scyllacluster/scyllamanager_object_storage.go:77
  [FAIL] ScyllaCluster authentication [It] agent requires authentication
  github.com/scylladb/scylla-operator/test/e2e/set/scyllacluster/verify.go:312

Hit a different set of tests (only one repeated) on a second run.

@rzetelskik
Copy link
Member Author

  [FAIL] ScyllaCluster [It] should set up Alternator API when enabled
  github.com/scylladb/scylla-operator/test/e2e/set/scyllacluster/scyllacluster_alternator.go:106
  [FAIL] ScyllaCluster should be exposed [It] using LoadBalancer Service, and broadcast external IP to client and PodIP to nodes
  github.com/scylladb/scylla-operator/test/e2e/set/scyllacluster/scyllacluster_expose.go:184
  [FAIL] ScyllaCluster webhook [It] should forbid invalid requests
  github.com/scylladb/scylla-operator/test/e2e/set/scyllacluster/verify.go:312
  [FAIL] ScyllaCluster [It] should connect to cluster via Ingresses
  github.com/scylladb/scylla-operator/test/e2e/set/scyllacluster/scyllacluster_expose.go:81
Ran 35 of 42 Specs in 2613.937 seconds
push_pin
FAIL! -- 31 Passed | 4 Failed | 0 Pending | 7 Skipped

https://prow.scylla-operator.scylladb.com/view/gs/scylla-operator-prow/logs/ci-scylla-operator-latest-e2e-gke-arm64-parallel/1812728893174452224#1:test-build-log.txt%3A2526

@rzetelskik
Copy link
Member Author

Summarizing 7 Failures:
  [FAIL] ScyllaCluster [It] should set up Alternator API when enabled
  github.com/scylladb/scylla-operator/test/e2e/set/scyllacluster/scyllacluster_alternator.go:106
  [FAIL] ScyllaCluster [It] should connect to cluster via Ingresses
  github.com/scylladb/scylla-operator/test/e2e/set/scyllacluster/scyllacluster_expose.go:81
  [FAIL] ScyllaCluster [It] should rotate TLS certificates before they expire
  github.com/scylladb/scylla-operator/test/e2e/set/scyllacluster/scyllacluster_tls.go:283
  [FAIL] ScyllaCluster should be exposed [It] using LoadBalancer Service, and broadcast external IP to client and PodIP to nodes
  github.com/scylladb/scylla-operator/test/e2e/set/scyllacluster/scyllacluster_expose.go:184
  [FAIL] ScyllaCluster webhook [It] should forbid invalid requests
  github.com/scylladb/scylla-operator/test/e2e/set/scyllacluster/scyllacluster_webhooks.go:91
  [FAIL] ScyllaCluster [It] should create ingress objects when ingress exposeOptions are provided
  github.com/scylladb/scylla-operator/test/e2e/set/scyllacluster/scyllacluster_expose.go:425
  [FAIL] Scylla Manager integration [It] should register cluster and sync repair tasks
  github.com/scylladb/scylla-operator/test/e2e/set/scyllacluster/verify.go:312
Ran 35 of 42 Specs in 2648.979 seconds
FAIL! -- 28 Passed | 7 Failed | 0 Pending | 7 Skipped

https://prow.scylla-operator.scylladb.com/view/gs/scylla-operator-prow/logs/ci-scylla-operator-latest-e2e-gke-arm64-parallel/1815265747585208320#1:test-build-log.txt%3A2625

@rzetelskik
Copy link
Member Author

rzetelskik commented Jul 25, 2024

All of the above runs had rotated scylla-operator logs so I had to trigger another one and collect the operator logs at runtime.

https://prow.scylla-operator.scylladb.com/view/gs/scylla-operator-prow/logs/ci-scylla-operator-latest-e2e-gke-arm64-parallel/1816472949608157184

arm-scylla-operator-logs.tar.gz

Same for the corresponding periodic on amd, for comparison:

https://prow.scylla-operator.scylladb.com/view/gs/scylla-operator-prow/logs/ci-scylla-operator-latest-e2e-gke/1816464102701142016

amd-scylla-operator-logs.tar.gz

Looking at the operator logs from the arm run, we can notice multiple sync loops taking up to a couple of minutes (!):

$ cat scylla-operator.current | grep -Pe '"Finished syncing ScyllaCluster"' | grep -Pe 'duration="\d+m\d+.?\d+s"'
I0725 14:11:10.756479       1 scyllacluster/sync.go:35] "Finished syncing ScyllaCluster" ScyllaCluster="e2e-test-scyllacluster-22284-0-8ckjq/basic-mdgcm" duration="2m7.728104396s"
I0725 14:13:14.374062       1 scyllacluster/sync.go:35] "Finished syncing ScyllaCluster" ScyllaCluster="e2e-test-scyllacluster-42ngm-0-fb6ng/basic-8lq6q" duration="4m11.345519112s"
I0725 14:13:17.246423       1 scyllacluster/sync.go:35] "Finished syncing ScyllaCluster" ScyllaCluster="e2e-test-scyllacluster-dxhdg-0-fzfw5/basic-ht4p2" duration="4m14.217749341s"
I0725 14:13:30.349335       1 scyllacluster/sync.go:35] "Finished syncing ScyllaCluster" ScyllaCluster="e2e-test-scyllacluster-vxv5r-0-4lnkm/basic-2qmxc" duration="4m27.318979934s"
I0725 14:13:43.533906       1 scyllacluster/sync.go:35] "Finished syncing ScyllaCluster" ScyllaCluster="e2e-test-scyllacluster-895ks-0-d4xkr/basic-xgws2" duration="4m40.505290246s"
I0725 14:13:50.272530       1 scyllacluster/sync.go:35] "Finished syncing ScyllaCluster" ScyllaCluster="e2e-test-scyllacluster-jw9gc-0-vg8wx/basic-km6mm" duration="4m47.243098724s"
I0725 14:13:53.858934       1 scyllacluster/sync.go:35] "Finished syncing ScyllaCluster" ScyllaCluster="e2e-test-scyllacluster-n8rjv-0-bnrwj/basic-l7hkf" duration="4m50.829226727s"
I0725 14:13:55.360155       1 scyllacluster/sync.go:35] "Finished syncing ScyllaCluster" ScyllaCluster="e2e-test-scyllacluster-chg8n-0-5cgl9/basic-kg57q" duration="4m52.331499482s"
I0725 14:13:55.697009       1 scyllacluster/sync.go:35] "Finished syncing ScyllaCluster" ScyllaCluster="e2e-test-scyllacluster-tzdvt-0-9c2bh/basic-rnqw2" duration="4m52.66683305s"
I0725 14:13:59.000582       1 scyllacluster/sync.go:35] "Finished syncing ScyllaCluster" ScyllaCluster="e2e-test-scyllacluster-s6mrn-0-fshxv/basic-k48nn" duration="4m55.970668727s"
I0725 14:14:01.170888       1 scyllacluster/sync.go:35] "Finished syncing ScyllaCluster" ScyllaCluster="e2e-test-scyllacluster-x5cff-0-z9srt/basic-p7gnl" duration="4m58.140463418s"
I0725 14:14:04.734596       1 scyllacluster/sync.go:35] "Finished syncing ScyllaCluster" ScyllaCluster="e2e-test-scylladbmonitoring-m4lh9-0-4fkmk/basic-ldz7m" duration="5m1.70386554s"
I0725 14:14:05.621209       1 scyllacluster/sync.go:35] "Finished syncing ScyllaCluster" ScyllaCluster="e2e-test-scyllacluster-rrc4k-0-6bxdg/basic-vxkh6" duration="5m2.591409241s"
I0725 14:14:07.119746       1 scyllacluster/sync.go:35] "Finished syncing ScyllaCluster" ScyllaCluster="e2e-test-scyllacluster-hrnhf-0-jgqkc/basic-d4c52" duration="5m4.090724515s"
I0725 14:14:11.243630       1 scyllacluster/sync.go:35] "Finished syncing ScyllaCluster" ScyllaCluster="e2e-test-scyllacluster-p7szv-0-9m26f/basic-b9rjk" duration="5m8.213871571s"
I0725 14:14:12.106736       1 scyllacluster/sync.go:35] "Finished syncing ScyllaCluster" ScyllaCluster="e2e-test-scyllacluster-zxbqd-0-tqsd9/basic-n88cp" duration="5m9.07602655s"
I0725 14:14:14.830402       1 scyllacluster/sync.go:35] "Finished syncing ScyllaCluster" ScyllaCluster="e2e-test-scyllacluster-jfn68-0-2jfk7/basic-cluster" duration="5m11.801252693s"
I0725 14:14:17.101075       1 scyllacluster/sync.go:35] "Finished syncing ScyllaCluster" ScyllaCluster="e2e-test-scyllacluster-gwn8d-0-nz848/basic-gml5h" duration="5m14.072196305s"
I0725 14:14:19.261568       1 scyllacluster/sync.go:35] "Finished syncing ScyllaCluster" ScyllaCluster="e2e-test-scyllacluster-jgtk4-0-qfmv8/basic-xq4bc" duration="5m16.232216795s"
I0725 14:14:20.861825       1 scyllacluster/sync.go:35] "Finished syncing ScyllaCluster" ScyllaCluster="e2e-test-scyllacluster-2fws8-0-cs67t/basic-7xzt6" duration="5m17.833448151s"
...

Looking in detail into the logs related to a selected one, e.g. e2e-test-scyllacluster-2fws8-0-cs67t/basic-7xzt6, what stands out are oustandingly long certificate creation times:

$ cat scylla-operator.current | grep -Pe "basic-7xzt6" | grep -Pe "Creating certificate|Certificate created"
I0725 14:09:06.446762       1 kubecrypto/certs.go:201] "Creating certificate" Secret="e2e-test-scyllacluster-2fws8-0-cs67t/basic-7xzt6-local-client-ca" Reason="needs new cert"
I0725 14:09:48.664868       1 kubecrypto/certs.go:206] "Certificate created" Secret="e2e-test-scyllacluster-2fws8-0-cs67t/basic-7xzt6-local-client-ca"
I0725 14:09:48.708975       1 kubecrypto/certs.go:201] "Creating certificate" Secret="e2e-test-scyllacluster-2fws8-0-cs67t/basic-7xzt6-local-user-admin" Reason="needs new cert"
I0725 14:11:25.862102       1 kubecrypto/certs.go:206] "Certificate created" Secret="e2e-test-scyllacluster-2fws8-0-cs67t/basic-7xzt6-local-user-admin"
I0725 14:11:25.887837       1 kubecrypto/certs.go:201] "Creating certificate" Secret="e2e-test-scyllacluster-2fws8-0-cs67t/basic-7xzt6-local-serving-ca" Reason="needs new cert"
I0725 14:12:53.069222       1 kubecrypto/certs.go:206] "Certificate created" Secret="e2e-test-scyllacluster-2fws8-0-cs67t/basic-7xzt6-local-serving-ca"
I0725 14:12:53.104701       1 kubecrypto/certs.go:201] "Creating certificate" Secret="e2e-test-scyllacluster-2fws8-0-cs67t/basic-7xzt6-local-serving-certs" Reason="needs new cert"
I0725 14:14:20.866875       1 kubecrypto/certs.go:201] "Creating certificate" Secret="e2e-test-scyllacluster-2fws8-0-cs67t/basic-7xzt6-local-serving-certs" Reason="needs new cert"
I0725 14:16:07.585097       1 kubecrypto/certs.go:206] "Certificate created" Secret="e2e-test-scyllacluster-2fws8-0-cs67t/basic-7xzt6-local-serving-certs"
I0725 14:16:17.713971       1 kubecrypto/certs.go:201] "Creating certificate" Secret="e2e-test-scyllacluster-2fws8-0-cs67t/basic-7xzt6-local-serving-certs" Reason="certificate needs an update"
I0725 14:17:57.205012       1 kubecrypto/certs.go:206] "Certificate created" Secret="e2e-test-scyllacluster-2fws8-0-cs67t/basic-7xzt6-local-serving-certs"
I0725 14:18:41.995833       1 kubecrypto/certs.go:201] "Creating certificate" Secret="e2e-test-scyllacluster-2fws8-0-cs67t/basic-7xzt6-local-serving-certs" Reason="certificate needs an update"
I0725 14:19:30.792616       1 kubecrypto/certs.go:206] "Certificate created" Secret="e2e-test-scyllacluster-2fws8-0-cs67t/basic-7xzt6-local-serving-certs"
I0725 14:19:45.431125       1 kubecrypto/certs.go:201] "Creating certificate" Secret="e2e-test-scyllacluster-2fws8-0-cs67t/basic-7xzt6-local-serving-certs" Reason="certificate needs an update"
I0725 14:20:40.414144       1 kubecrypto/certs.go:206] "Certificate created" Secret="e2e-test-scyllacluster-2fws8-0-cs67t/basic-7xzt6-local-serving-certs"
I0725 14:20:50.462781       1 kubecrypto/certs.go:201] "Creating certificate" Secret="e2e-test-scyllacluster-2fws8-0-cs67t/basic-7xzt6-local-serving-certs" Reason="certificate needs an update"
I0725 14:21:13.963127       1 kubecrypto/certs.go:206] "Certificate created" Secret="e2e-test-scyllacluster-2fws8-0-cs67t/basic-7xzt6-local-serving-certs"
I0725 14:21:41.906365       1 kubecrypto/certs.go:201] "Creating certificate" Secret="e2e-test-scyllacluster-2fws8-0-cs67t/basic-7xzt6-local-serving-certs" Reason="certificate needs an update"
I0725 14:21:51.758825       1 kubecrypto/certs.go:206] "Certificate created" Secret="e2e-test-scyllacluster-2fws8-0-cs67t/basic-7xzt6-local-serving-certs"
I0725 14:23:52.682616       1 kubecrypto/certs.go:201] "Creating certificate" Secret="e2e-test-scyllacluster-2fws8-0-cs67t/basic-7xzt6-local-serving-certs" Reason="certificate needs an update"
I0725 14:23:52.712412       1 kubecrypto/certs.go:206] "Certificate created" Secret="e2e-test-scyllacluster-2fws8-0-cs67t/basic-7xzt6-local-serving-certs"
I0725 14:24:42.351605       1 kubecrypto/certs.go:201] "Creating certificate" Secret="e2e-test-scyllacluster-2fws8-0-cs67t/basic-7xzt6-local-serving-certs" Reason="certificate needs an update"
I0725 14:24:42.399630       1 kubecrypto/certs.go:206] "Certificate created" Secret="e2e-test-scyllacluster-2fws8-0-cs67t/basic-7xzt6-local-serving-certs"
I0725 14:25:52.819280       1 kubecrypto/certs.go:201] "Creating certificate" Secret="e2e-test-scyllacluster-2fws8-0-cs67t/basic-7xzt6-local-serving-certs" Reason="certificate needs an update"
I0725 14:25:52.854002       1 kubecrypto/certs.go:201] "Creating certificate" Secret="e2e-test-scyllacluster-2fws8-0-cs67t/basic-7xzt6-local-serving-certs" Reason="certificate needs an update"
I0725 14:25:52.906009       1 kubecrypto/certs.go:201] "Creating certificate" Secret="e2e-test-scyllacluster-2fws8-0-cs67t/basic-7xzt6-local-serving-certs" Reason="certificate needs an update"
I0725 14:25:53.007226       1 kubecrypto/certs.go:201] "Creating certificate" Secret="e2e-test-scyllacluster-2fws8-0-cs67t/basic-7xzt6-local-serving-certs" Reason="certificate needs an update"
I0725 14:25:53.061783       1 kubecrypto/certs.go:201] "Creating certificate" Secret="e2e-test-scyllacluster-2fws8-0-cs67t/basic-7xzt6-local-serving-certs" Reason="certificate needs an update"
I0725 14:25:53.088583       1 kubecrypto/certs.go:201] "Creating certificate" Secret="e2e-test-scyllacluster-2fws8-0-cs67t/basic-7xzt6-local-serving-certs" Reason="certificate needs an update"
...

During the ~10 minutes after the initial burst, key generation completes 237 times:

$ cat scylla-operator.current | grep -Pe "Generating item finished" | grep -Pe "I0725 14:(09|1\d):\d+.?\d* " | wc -l
237

With the following average duration:

$ cat scylla-operator.current | grep -Pe "Generating item finished" | grep -Pe "I0725 14:(09|1\d):\d+.?\d* " | awk 'match($0, /Duration="([0-9.]+)(ms|s)"/, a) { d = a[1]; if (a[2] == "ms") { d = d / 1000; }; s += d; c++; } END { print s / c, "s" ; }'
5.39327 s

Meanwhile, for the amd run:

$ cat scylla-operator.current | grep -Pe "Generating item finished" | grep -Pe "I0725 13:(3[4-9]|4[0-3]):\d+.?\d* " | wc -l
559
$ cat scylla-operator.current | grep -Pe "Generating item finished" | grep -Pe "I0725 13:(3[4-9]|4[0-5]):\d+.?\d* " | awk 'match($0, /Duration="([0-9.]+)(ms|s)"/, a) { d = a[1]; if (a[2] == "ms") { d = d / 1000; }; s += d; c++; } END { print s / c, "s" ; }'
1.69285 s

Although the difference doesn't seem that drastic, the wait is enough to push the tests over the timeouts.

Both periodics are running Scylla Operator with the following settings:

I0725 14:08:48.702405       1 flag/flags.go:64] FLAG: --crypto-key-buffer-delay="200ms"
I0725 14:08:48.702409       1 flag/flags.go:64] FLAG: --crypto-key-buffer-size-max="30"
I0725 14:08:48.702412       1 flag/flags.go:64] FLAG: --crypto-key-buffer-size-min="10"

I see a few ways to continue from this spot:

  • assessing whether the difference in key generation time is expected with the selected setups
  • given the difference is expected, increasing crypto-key-buffer-size-min (I haven't tested this or thought about the implications yet)
  • given the difference is expected, limiting the parallelism

@tnozicka
Copy link
Member

Thanks for looking into the runs. From your finding the key generation seems about 3,2 times slower on arm, which if if attributed as the major factor, the amd sync loops peak would be around 1m 33s - is that so? Both seem quite long to me.

assessing whether the difference in key generation time is expected with the selected setups

pretty much, but we just use stdlib

given the difference is expected, increasing crypto-key-buffer-size-min (I haven't tested this or thought about the implications yet)

e2e tests need marginally more certificates than a min, so I wouldn't hold your breath for that one

given the difference is expected, limiting the parallelism

I think adjusting the parallelism per CI job seems like a reasonable action to take.

Have you tried adjusting operator cpu requests? We are pretty low (10m) and proportionally to other workloads that can suck.
Also for periodic we should setup a the nodes closer to the production with dedicated pools, that would also help if the cpu is being "stolen" by a workload.

@rzetelskik
Copy link
Member Author

rzetelskik commented Jul 26, 2024

Thanks for looking into the runs. From your finding the key generation seems about 3,2 times slower on arm, which if if attributed as the major factor, the amd sync loops peak would be around 1m 33s - is that so? Both seem quite long to me.

Roughly, yes.

$ cat scylla-operator.current | grep -Pe '"Finished syncing ScyllaCluster"' | grep -Pe 'duration="\d+m\d+.?\d+s"'
I0725 13:35:27.995150       1 scyllacluster/sync.go:35] "Finished syncing ScyllaCluster" ScyllaCluster="e2e-test-scyllacluster-lfp8q-0-jhcrz/basic-t2lqx" duration="1m6.499951198s"
I0725 13:35:30.434984       1 scyllacluster/sync.go:35] "Finished syncing ScyllaCluster" ScyllaCluster="e2e-test-scyllacluster-dc7kg-0-pw2ls/basic-7w299" duration="1m9.010835583s"
I0725 13:35:35.864480       1 scyllacluster/sync.go:35] "Finished syncing ScyllaCluster" ScyllaCluster="e2e-test-scyllacluster-5z6v6-0-7xn4l/basic-4fx9m" duration="1m14.333902012s"
I0725 13:35:36.680612       1 scyllacluster/sync.go:35] "Finished syncing ScyllaCluster" ScyllaCluster="e2e-test-scyllacluster-89fp2-0-f4pxt/basic-btbfh" duration="1m15.499290731s"
I0725 13:35:38.914919       1 scyllacluster/sync.go:35] "Finished syncing ScyllaCluster" ScyllaCluster="e2e-test-scyllacluster-gf9wc-0-8ghhc/basic-tlx6t" duration="1m18.294550206s"
I0725 13:35:49.796967       1 scyllacluster/sync.go:35] "Finished syncing ScyllaCluster" ScyllaCluster="e2e-test-scyllacluster-zqtj8-0-cmr8b/basic-75kw7" duration="1m27.345351206s"
I0725 13:35:51.311263       1 scyllacluster/sync.go:35] "Finished syncing ScyllaCluster" ScyllaCluster="e2e-test-scyllacluster-xssdd-0-n7qb2/basic-kblqm" duration="1m29.596688644s"
I0725 13:35:51.658069       1 scyllacluster/sync.go:35] "Finished syncing ScyllaCluster" ScyllaCluster="e2e-test-scyllacluster-r4gwr-0-mx8sz/basic-l9tc5" duration="1m30.248506202s"
I0725 13:35:52.631809       1 scyllacluster/sync.go:35] "Finished syncing ScyllaCluster" ScyllaCluster="e2e-test-scyllacluster-95f6r-0-zxk5d/basic-98f4w" duration="1m31.964820087s"
I0725 13:35:53.067595       1 scyllacluster/sync.go:35] "Finished syncing ScyllaCluster" ScyllaCluster="e2e-test-scyllacluster-zb9hm-0-lwz5r/basic-d4tm4" duration="1m31.620670181s"
I0725 13:35:54.280823       1 scyllacluster/sync.go:35] "Finished syncing ScyllaCluster" ScyllaCluster="e2e-test-scyllacluster-kmgxc-0-qvdmf/basic-jzh9m" duration="1m30.207411474s"
I0725 13:35:55.333719       1 scyllacluster/sync.go:35] "Finished syncing ScyllaCluster" ScyllaCluster="e2e-test-scyllacluster-gqfln-0-qpr2c/basic-cluster" duration="1m30.863168538s"
I0725 13:35:55.443259       1 scyllacluster/sync.go:35] "Finished syncing ScyllaCluster" ScyllaCluster="e2e-test-scyllacluster-mkq9g-0-zfp8b/basic-jzg6b" duration="1m35.037964324s"
I0725 13:35:57.718797       1 scyllacluster/sync.go:35] "Finished syncing ScyllaCluster" ScyllaCluster="e2e-test-scyllacluster-cms6b-0-8k7wb/basic-8wkgn" duration="1m36.104571344s"
I0725 13:35:58.830183       1 scyllacluster/sync.go:35] "Finished syncing ScyllaCluster" ScyllaCluster="e2e-test-scyllacluster-5jtwt-0-nk4cr/basic-28pcd" duration="1m37.795002155s"
I0725 13:36:01.446831       1 scyllacluster/sync.go:35] "Finished syncing ScyllaCluster" ScyllaCluster="e2e-test-scyllacluster-pkd4f-0-kl2w8/basic-6qq8b" duration="1m37.014963923s"
I0725 13:36:01.929815       1 scyllacluster/sync.go:35] "Finished syncing ScyllaCluster" ScyllaCluster="e2e-test-scyllacluster-6cftn-0-s9hqt/basic-t46q9" duration="1m40.521120484s"
I0725 13:36:02.119672       1 scyllacluster/sync.go:35] "Finished syncing ScyllaCluster" ScyllaCluster="e2e-test-scyllacluster-gn5v8-0-2pvjq/basic-lwdxf" duration="1m37.984944679s"
I0725 13:36:03.372593       1 scyllacluster/sync.go:35] "Finished syncing ScyllaCluster" ScyllaCluster="e2e-test-scyllacluster-rfr42-0-7ws9d/basic-x4mzq" duration="1m42.321278883s"
I0725 13:36:04.436859       1 scyllacluster/sync.go:35] "Finished syncing ScyllaCluster" ScyllaCluster="e2e-test-scylladbmonitoring-h7z9l-0-r5848/basic-f8ksq" duration="1m43.855739567s"
I0725 13:36:05.512803       1 scyllacluster/sync.go:35] "Finished syncing ScyllaCluster" ScyllaCluster="e2e-test-scyllacluster-zrkbh-0-jm9ql/basic-r6rcs" duration="1m41.439237718s"
I0725 13:36:07.164390       1 scyllacluster/sync.go:35] "Finished syncing ScyllaCluster" ScyllaCluster="e2e-test-scyllacluster-lj4f9-0-f95sd/basic-9w2cb" duration="1m44.523659635s"
I0725 13:36:08.255956       1 scyllacluster/sync.go:35] "Finished syncing ScyllaCluster" ScyllaCluster="e2e-test-scyllacluster-zld8s-0-vz27s/basic-lt5qz" duration="1m42.435573268s"
I0725 13:36:08.747910       1 scyllacluster/sync.go:35] "Finished syncing ScyllaCluster" ScyllaCluster="e2e-test-scyllacluster-7wk58-0-ggwf5/basic-b6pmn" duration="1m44.016653585s"
I0725 13:36:09.785480       1 scyllacluster/sync.go:35] "Finished syncing ScyllaCluster" ScyllaCluster="e2e-test-scyllacluster-cc9wv-0-bspk9/basic-pqxhl" duration="1m45.342717379s"
I0725 13:36:10.670517       1 scyllacluster/sync.go:35] "Finished syncing ScyllaCluster" ScyllaCluster="e2e-test-scyllacluster-hhgpp-0-v9ggw/basic-zjxh8" duration="1m45.374712765s"
I0725 13:36:11.728104       1 scyllacluster/sync.go:35] "Finished syncing ScyllaCluster" ScyllaCluster="e2e-test-scyllacluster-hcm9t-0-ggkhd/basic-bgf5k" duration="1m46.906860991s"
I0725 13:36:13.690345       1 scyllacluster/sync.go:35] "Finished syncing ScyllaCluster" ScyllaCluster="e2e-test-scyllacluster-kc5n5-0-khqdd/basic-srzxf" duration="1m49.616922117s"
I0725 13:36:13.903032       1 scyllacluster/sync.go:35] "Finished syncing ScyllaCluster" ScyllaCluster="e2e-test-scyllacluster-btnpz-0-mv2xj/basic-4djtw" duration="1m50.014991322s"
I0725 13:36:14.815469       1 scyllacluster/sync.go:35] "Finished syncing ScyllaCluster" ScyllaCluster="e2e-test-scyllacluster-sm8px-0-f9txh/basic-rzhmc" duration="1m50.149596829s"
I0725 13:36:15.860553       1 scyllacluster/sync.go:35] "Finished syncing ScyllaCluster" ScyllaCluster="e2e-test-scyllacluster-w62bx-0-dlvld/basic-8twh5" duration="1m48.275981026s"
I0725 13:36:40.958250       1 scyllacluster/sync.go:35] "Finished syncing ScyllaCluster" ScyllaCluster="e2e-test-scyllacluster-5567k-0-vsgx5/basic-4rbsg" duration="2m19.535899999s"

e2e tests need marginally more certificates than a min, so I wouldn't hold your breath for that one

How so? As per the report above, the number of generated keys in the ~10 minutes after the e2e workers started was ~250, while the min was set to 10. The workers all wait on a single queue to get their certificates generated, so imo this disproportion looks concerning.

Have you tried adjusting operator cpu requests? We are pretty low (10m) and proportionally to other workloads that can suck.

I haven't, testing different setups either requires pushing changes to the CI or a lengthy manual process, hence I wanted to consult the approach first. Is there a reason we're running operator on that small of a cpu share in CI now?

@tnozicka
Copy link
Member

How so? As per the report above, the number of generated keys in the ~10 minutes after the e2e workers started was ~250, while the min was set to 10.

The min is also the number of certs you'll generate as waste before the end of your run.

It serves to distribute the needed compute time across the curve but from the 5m sync time I assume the operator is constantly under load.

I haven't, testing different setups either requires pushing changes to the CI or a lengthy manual process, hence I wanted to consult the approach first. Is there a reason we're running operator on that small of a cpu share in CI now?

Historically to deploy to very small platforms and it didn't generate certs back then. 100m or so still lgtm, if it helps.

@rzetelskik
Copy link
Member Author

Raising the operator's cpu requests to 100m didn't help (in fact the peak reconciliation times were even higher, but I wouldn't take that as a direct consequence, or a significant result of the one run whatsoever).

We discussed it internally and agreed on a couple of actionable items:

  • further increase operator's cpu requests (not in the general-purpose deploy manifests this time, but through ENV variables passed to ci scripts)
  • limit parallelism (again requires ENV variables in CI scripts)
  • observe the min/max values for key generation and raise them if the buffer sits empty or hits the rate limiting too often

@rzetelskik
Copy link
Member Author

Further increasing operator's cpu requests to 500m didn't seem to decrease reconciliation times at all, it seems that in setups we're using in CI operator already sits at higher resource consumption. Decreasing parallelisation by roughly a half seems like a reasonable next step.

@zimnx
Copy link
Collaborator

zimnx commented Jul 31, 2024

Further increasing operator's cpu requests to 500m didn't seem to decrease reconciliation times at all, it seems that in setups we're using in CI operator already sits at higher resource consumption. Decreasing parallelisation by roughly a half seems like a reasonable next step.

I'm not sure increasing what we request would help anyhow. Operator lives on separate to Scylla nodes, which are not busy. Hence it may still get all cores available, regardless of cpu requests settings.

@rzetelskik
Copy link
Member Author

Operator lives on separate to Scylla nodes, which are not busy. Hence it may still get all cores available, regardless of cpu requests settings.

We create the dedicated infra node pools, but none of the workloads actually have the toleration for the dedicated-pool taint.
Perhaps this is something worth fixing before adjusting the config.
Iirc we had an issue for it in one of the repositories but I can't find anything right now, @tnozicka do you recall if we have it?

@zimnx
Copy link
Collaborator

zimnx commented Jul 31, 2024

We create the dedicated infra node pools, but none of the workloads actually have the toleration for the dedicated-pool taint.

Then indeed it fights with others. We should fix that, as we have nodes (not huge ones) that are billable but nothing is being executed there.

@tnozicka
Copy link
Member

tnozicka commented Aug 1, 2024

I'm not sure increasing what we request would help anyhow.

In the best effort class you get cpu quota proportional to your requests but given the sheer number if scyllacluster pods around that's still a small fraction.

We create the dedicated infra node pools, but none of the workloads actually have the toleration for the dedicated-pool taint.

yeah, the operator goes into the workload pool
nodeName: gke-so-d8edd813-89ae-4968-95d-workers-73850519-jnss

Iirc we had an issue for it in one of the repositories but I can't find anything right now, @tnozicka do you recall if we have it?

I don't have it but it's something I've bumped as an issue a few times lately. We should define well named taints and tolerations prefixed with our domain and apply that pattern through our deploy files, examples and the docs. It has it's own issues as explained in #2049

Even with dedicated nodes, the operator may steel cpu cycles from say the admission webhook and make it timeout...

Another option is to use a Guaranteed QoS class. That's less resource sharing but more predictable and reproducible results. It's also orthogonal to the taints and tolerations.

@rzetelskik
Copy link
Member Author

https://prow.scylla-operator.scylladb.com/view/gs/scylla-operator-prow/logs/ci-scylla-operator-latest-e2e-gke-arm64-parallel/1820339182455754752
passed with lower parallelism

I suggest we close this issue and take care of #2049 as a followup. @tnozicka should I take this one?

@tnozicka
Copy link
Member

tnozicka commented Aug 5, 2024

sounds good, thx

@tnozicka tnozicka closed this as completed Aug 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now.
Projects
None yet
Development

No branches or pull requests

3 participants