-
Notifications
You must be signed in to change notification settings - Fork 175
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multiple E2E tests time out waiting for Scylla Cluster to roll out when running on arm64 machines #1997
Comments
Hit a different set of tests (only one repeated) on a second run. |
|
|
All of the above runs had rotated scylla-operator logs so I had to trigger another one and collect the operator logs at runtime. arm-scylla-operator-logs.tar.gz Same for the corresponding periodic on amd, for comparison: amd-scylla-operator-logs.tar.gz Looking at the operator logs from the arm run, we can notice multiple sync loops taking up to a couple of minutes (!):
Looking in detail into the logs related to a selected one, e.g.
During the ~10 minutes after the initial burst, key generation completes 237 times:
With the following average duration:
Meanwhile, for the amd run:
Although the difference doesn't seem that drastic, the wait is enough to push the tests over the timeouts. Both periodics are running Scylla Operator with the following settings:
I see a few ways to continue from this spot:
|
Thanks for looking into the runs. From your finding the key generation seems about 3,2 times slower on arm, which if if attributed as the major factor, the amd sync loops peak would be around 1m 33s - is that so? Both seem quite long to me.
pretty much, but we just use stdlib
e2e tests need marginally more certificates than a min, so I wouldn't hold your breath for that one
I think adjusting the parallelism per CI job seems like a reasonable action to take. Have you tried adjusting operator cpu requests? We are pretty low ( |
Roughly, yes. $ cat scylla-operator.current | grep -Pe '"Finished syncing ScyllaCluster"' | grep -Pe 'duration="\d+m\d+.?\d+s"'
I0725 13:35:27.995150 1 scyllacluster/sync.go:35] "Finished syncing ScyllaCluster" ScyllaCluster="e2e-test-scyllacluster-lfp8q-0-jhcrz/basic-t2lqx" duration="1m6.499951198s"
I0725 13:35:30.434984 1 scyllacluster/sync.go:35] "Finished syncing ScyllaCluster" ScyllaCluster="e2e-test-scyllacluster-dc7kg-0-pw2ls/basic-7w299" duration="1m9.010835583s"
I0725 13:35:35.864480 1 scyllacluster/sync.go:35] "Finished syncing ScyllaCluster" ScyllaCluster="e2e-test-scyllacluster-5z6v6-0-7xn4l/basic-4fx9m" duration="1m14.333902012s"
I0725 13:35:36.680612 1 scyllacluster/sync.go:35] "Finished syncing ScyllaCluster" ScyllaCluster="e2e-test-scyllacluster-89fp2-0-f4pxt/basic-btbfh" duration="1m15.499290731s"
I0725 13:35:38.914919 1 scyllacluster/sync.go:35] "Finished syncing ScyllaCluster" ScyllaCluster="e2e-test-scyllacluster-gf9wc-0-8ghhc/basic-tlx6t" duration="1m18.294550206s"
I0725 13:35:49.796967 1 scyllacluster/sync.go:35] "Finished syncing ScyllaCluster" ScyllaCluster="e2e-test-scyllacluster-zqtj8-0-cmr8b/basic-75kw7" duration="1m27.345351206s"
I0725 13:35:51.311263 1 scyllacluster/sync.go:35] "Finished syncing ScyllaCluster" ScyllaCluster="e2e-test-scyllacluster-xssdd-0-n7qb2/basic-kblqm" duration="1m29.596688644s"
I0725 13:35:51.658069 1 scyllacluster/sync.go:35] "Finished syncing ScyllaCluster" ScyllaCluster="e2e-test-scyllacluster-r4gwr-0-mx8sz/basic-l9tc5" duration="1m30.248506202s"
I0725 13:35:52.631809 1 scyllacluster/sync.go:35] "Finished syncing ScyllaCluster" ScyllaCluster="e2e-test-scyllacluster-95f6r-0-zxk5d/basic-98f4w" duration="1m31.964820087s"
I0725 13:35:53.067595 1 scyllacluster/sync.go:35] "Finished syncing ScyllaCluster" ScyllaCluster="e2e-test-scyllacluster-zb9hm-0-lwz5r/basic-d4tm4" duration="1m31.620670181s"
I0725 13:35:54.280823 1 scyllacluster/sync.go:35] "Finished syncing ScyllaCluster" ScyllaCluster="e2e-test-scyllacluster-kmgxc-0-qvdmf/basic-jzh9m" duration="1m30.207411474s"
I0725 13:35:55.333719 1 scyllacluster/sync.go:35] "Finished syncing ScyllaCluster" ScyllaCluster="e2e-test-scyllacluster-gqfln-0-qpr2c/basic-cluster" duration="1m30.863168538s"
I0725 13:35:55.443259 1 scyllacluster/sync.go:35] "Finished syncing ScyllaCluster" ScyllaCluster="e2e-test-scyllacluster-mkq9g-0-zfp8b/basic-jzg6b" duration="1m35.037964324s"
I0725 13:35:57.718797 1 scyllacluster/sync.go:35] "Finished syncing ScyllaCluster" ScyllaCluster="e2e-test-scyllacluster-cms6b-0-8k7wb/basic-8wkgn" duration="1m36.104571344s"
I0725 13:35:58.830183 1 scyllacluster/sync.go:35] "Finished syncing ScyllaCluster" ScyllaCluster="e2e-test-scyllacluster-5jtwt-0-nk4cr/basic-28pcd" duration="1m37.795002155s"
I0725 13:36:01.446831 1 scyllacluster/sync.go:35] "Finished syncing ScyllaCluster" ScyllaCluster="e2e-test-scyllacluster-pkd4f-0-kl2w8/basic-6qq8b" duration="1m37.014963923s"
I0725 13:36:01.929815 1 scyllacluster/sync.go:35] "Finished syncing ScyllaCluster" ScyllaCluster="e2e-test-scyllacluster-6cftn-0-s9hqt/basic-t46q9" duration="1m40.521120484s"
I0725 13:36:02.119672 1 scyllacluster/sync.go:35] "Finished syncing ScyllaCluster" ScyllaCluster="e2e-test-scyllacluster-gn5v8-0-2pvjq/basic-lwdxf" duration="1m37.984944679s"
I0725 13:36:03.372593 1 scyllacluster/sync.go:35] "Finished syncing ScyllaCluster" ScyllaCluster="e2e-test-scyllacluster-rfr42-0-7ws9d/basic-x4mzq" duration="1m42.321278883s"
I0725 13:36:04.436859 1 scyllacluster/sync.go:35] "Finished syncing ScyllaCluster" ScyllaCluster="e2e-test-scylladbmonitoring-h7z9l-0-r5848/basic-f8ksq" duration="1m43.855739567s"
I0725 13:36:05.512803 1 scyllacluster/sync.go:35] "Finished syncing ScyllaCluster" ScyllaCluster="e2e-test-scyllacluster-zrkbh-0-jm9ql/basic-r6rcs" duration="1m41.439237718s"
I0725 13:36:07.164390 1 scyllacluster/sync.go:35] "Finished syncing ScyllaCluster" ScyllaCluster="e2e-test-scyllacluster-lj4f9-0-f95sd/basic-9w2cb" duration="1m44.523659635s"
I0725 13:36:08.255956 1 scyllacluster/sync.go:35] "Finished syncing ScyllaCluster" ScyllaCluster="e2e-test-scyllacluster-zld8s-0-vz27s/basic-lt5qz" duration="1m42.435573268s"
I0725 13:36:08.747910 1 scyllacluster/sync.go:35] "Finished syncing ScyllaCluster" ScyllaCluster="e2e-test-scyllacluster-7wk58-0-ggwf5/basic-b6pmn" duration="1m44.016653585s"
I0725 13:36:09.785480 1 scyllacluster/sync.go:35] "Finished syncing ScyllaCluster" ScyllaCluster="e2e-test-scyllacluster-cc9wv-0-bspk9/basic-pqxhl" duration="1m45.342717379s"
I0725 13:36:10.670517 1 scyllacluster/sync.go:35] "Finished syncing ScyllaCluster" ScyllaCluster="e2e-test-scyllacluster-hhgpp-0-v9ggw/basic-zjxh8" duration="1m45.374712765s"
I0725 13:36:11.728104 1 scyllacluster/sync.go:35] "Finished syncing ScyllaCluster" ScyllaCluster="e2e-test-scyllacluster-hcm9t-0-ggkhd/basic-bgf5k" duration="1m46.906860991s"
I0725 13:36:13.690345 1 scyllacluster/sync.go:35] "Finished syncing ScyllaCluster" ScyllaCluster="e2e-test-scyllacluster-kc5n5-0-khqdd/basic-srzxf" duration="1m49.616922117s"
I0725 13:36:13.903032 1 scyllacluster/sync.go:35] "Finished syncing ScyllaCluster" ScyllaCluster="e2e-test-scyllacluster-btnpz-0-mv2xj/basic-4djtw" duration="1m50.014991322s"
I0725 13:36:14.815469 1 scyllacluster/sync.go:35] "Finished syncing ScyllaCluster" ScyllaCluster="e2e-test-scyllacluster-sm8px-0-f9txh/basic-rzhmc" duration="1m50.149596829s"
I0725 13:36:15.860553 1 scyllacluster/sync.go:35] "Finished syncing ScyllaCluster" ScyllaCluster="e2e-test-scyllacluster-w62bx-0-dlvld/basic-8twh5" duration="1m48.275981026s"
I0725 13:36:40.958250 1 scyllacluster/sync.go:35] "Finished syncing ScyllaCluster" ScyllaCluster="e2e-test-scyllacluster-5567k-0-vsgx5/basic-4rbsg" duration="2m19.535899999s"
How so? As per the report above, the number of generated keys in the ~10 minutes after the e2e workers started was ~250, while the min was set to 10. The workers all wait on a single queue to get their certificates generated, so imo this disproportion looks concerning.
I haven't, testing different setups either requires pushing changes to the CI or a lengthy manual process, hence I wanted to consult the approach first. Is there a reason we're running operator on that small of a cpu share in CI now? |
The min is also the number of certs you'll generate as waste before the end of your run. It serves to distribute the needed compute time across the curve but from the 5m sync time I assume the operator is constantly under load.
Historically to deploy to very small platforms and it didn't generate certs back then. |
Raising the operator's cpu requests to 100m didn't help (in fact the peak reconciliation times were even higher, but I wouldn't take that as a direct consequence, or a significant result of the one run whatsoever). We discussed it internally and agreed on a couple of actionable items:
|
Further increasing operator's cpu requests to 500m didn't seem to decrease reconciliation times at all, it seems that in setups we're using in CI operator already sits at higher resource consumption. Decreasing parallelisation by roughly a half seems like a reasonable next step. |
I'm not sure increasing what we request would help anyhow. Operator lives on separate to Scylla nodes, which are not busy. Hence it may still get all cores available, regardless of cpu requests settings. |
We create the dedicated |
Then indeed it fights with others. We should fix that, as we have nodes (not huge ones) that are billable but nothing is being executed there. |
In the best effort class you get cpu quota proportional to your requests but given the sheer number if scyllacluster pods around that's still a small fraction.
yeah, the operator goes into the workload pool
I don't have it but it's something I've bumped as an issue a few times lately. We should define well named taints and tolerations prefixed with our domain and apply that pattern through our deploy files, examples and the docs. It has it's own issues as explained in #2049 Even with dedicated nodes, the operator may steel cpu cycles from say the admission webhook and make it timeout... Another option is to use a Guaranteed QoS class. That's less resource sharing but more predictable and reproducible results. It's also orthogonal to the taints and tolerations. |
https://prow.scylla-operator.scylladb.com/view/gs/scylla-operator-prow/logs/ci-scylla-operator-latest-e2e-gke-arm64-parallel/1820339182455754752 I suggest we close this issue and take care of #2049 as a followup. @tnozicka should I take this one? |
sounds good, thx |
https://prow.scylla-operator.scylladb.com/view/gs/scylla-operator-prow/logs/ci-scylla-operator-latest-e2e-gke-arm64-parallel/1807655512678862848#1:test-build-log.txt%3A2566
When running on arm64 machines, multiple tests in our e2e parallel suite fail waiting for Scylla Cluster to roll out.
We should determine whether it's caused by slower disks (Tau T2A machines do not support local SSDs) or if it's infrastructure specific.
/kind failing-test
The text was updated successfully, but these errors were encountered: