Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flake: ScyllaCluster should allow to build connection pool using shard aware ports #1028

Open
tnozicka opened this issue Sep 15, 2022 · 39 comments
Assignees
Labels
kind/flake Categorizes issue or PR as related to a flaky test. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.

Comments

@tnozicka
Copy link
Member

https://github.com/scylladb/scylla-operator/actions/runs/3056629503/jobs/4931011388#step:12:664

↺ [FLAKEY TEST - TOOK 2 ATTEMPTS TO PASS] [SLOW TEST] [716.582 seconds]
ScyllaCluster
github.com/scylladb/scylla-operator/test/e2e/set/scyllacluster/scyllacluster_shardawareness.go:25
  should allow to build connection pool using shard aware ports
  github.com/scylladb/scylla-operator/test/e2e/set/scyllacluster/scyllacluster_shardawareness.go:30

  Begin Captured StdOut/StdErr Output >>
    I0914 23:21:03.448313      47 framework/framework.go:249] "Namespace removed." Namespace="e2e-test-scyllacluster-jmx8m-nqlnz"
    I0914 23:26:03.711932      47 framework/framework.go:249] "Namespace removed." Namespace="e2e-test-scyllacluster-jmx8m-psjsw"
  << End Captured StdOut/StdErr Output

  Begin Captured GinkgoWriter Output >>
    [BeforeEach] ScyllaCluster
      github.com/scylladb/scylla-operator/test/e2e/framework/framework.go:57
    STEP: Creating a new namespace 09/14/22 23:14:07.13
    Sep 14 23:14:07.185: INFO: Created namespace "e2e-test-scyllacluster-jmx8m-nqlnz".
    STEP: Waiting for ServiceAccount "e2e-user" in namespace "e2e-test-scyllacluster-jmx8m-nqlnz". 09/14/22 23:14:07.215
    STEP: Waiting for default ServiceAccount in namespace "e2e-test-scyllacluster-jmx8m-nqlnz". 09/14/22 23:14:08.878
    [It] should allow to build connection pool using shard aware ports
      github.com/scylladb/scylla-operator/test/e2e/set/scyllacluster/scyllacluster_shardawareness.go:30
    STEP: Creating a ScyllaCluster 09/14/22 23:14:08.883
    STEP: Waiting for the ScyllaCluster to deploy 09/14/22 23:14:08.9
    Sep 14 23:17:38.950: INFO: ScyllaCluster e2e-test-scyllacluster-jmx8m-nqlnz/basic-x5qk6 (RV=4358) is rolled out
    STEP: Waiting for the driver to establish connection to shards 09/14/22 23:17:43.979
    Sep 14 23:17:43.981: INFO: Connecting to 10.103.26.49:9042 using 0 source port
    Sep 14 23:17:44.019: INFO: Connecting to 10.103.26.49:9042 using 0 source port
    Sep 14 23:17:44.063: INFO: Connecting to 10.103.26.49:9042 using 0 source port
    Sep 14 23:17:44.072: INFO: Connecting to 10.103.26.49:19042 using 32769 source port
    Sep 14 23:17:44.072: INFO: Connecting to 10.103.26.49:19042 using 32771 source port
    [AfterEach] ScyllaCluster
      github.com/scylladb/scylla-operator/test/e2e/framework/framework.go:58
    STEP: Collecting events from namespace "e2e-test-scyllacluster-jmx8m-nqlnz". 09/14/22 23:17:49.086
    STEP: Found 32 events. 09/14/22 23:17:49.189
    Sep 14 23:17:49.189: INFO: At 2022-09-14 23:14:27 +0000 UTC - event for basic-x5qk6-auth-token: {scyllacluster-controller } SecretCreated: Secret e2e-test-scyllacluster-jmx8m-nqlnz/basic-x5qk6-auth-token created
    Sep 14 23:17:49.189: INFO: At 2022-09-14 23:14:27 +0000 UTC - event for basic-x5qk6-member: {scyllacluster-controller } ServiceAccountCreated: ServiceAccount e2e-test-scyllacluster-jmx8m-nqlnz/basic-x5qk6-member created
    Sep 14 23:17:49.189: INFO: At 2022-09-14 23:14:27 +0000 UTC - event for basic-x5qk6-member: {scyllacluster-controller } RoleBindingCreated: RoleBinding e2e-test-scyllacluster-jmx8m-nqlnz/basic-x5qk6-member created
    Sep 14 23:17:49.189: INFO: At 2022-09-14 23:14:27 +0000 UTC - event for basic-x5qk6-us-east-1-us-east-1a: {statefulset-controller } SuccessfulCreate: create Pod basic-x5qk6-us-east-1-us-east-1a-0 in StatefulSet basic-x5qk6-us-east-1-us-east-1a successful
    Sep 14 23:17:49.189: INFO: At 2022-09-14 23:14:27 +0000 UTC - event for basic-x5qk6-us-east-1-us-east-1a: {scyllacluster-controller } StatefulSetCreated: StatefulSet e2e-test-scyllacluster-jmx8m-nqlnz/basic-x5qk6-us-east-1-us-east-1a created
    Sep 14 23:17:49.189: INFO: At 2022-09-14 23:14:27 +0000 UTC - event for basic-x5qk6-us-east-1-us-east-1a: {statefulset-controller } SuccessfulCreate: create Claim data-basic-x5qk6-us-east-1-us-east-1a-0 Pod basic-x5qk6-us-east-1-us-east-1a-0 in StatefulSet basic-x5qk6-us-east-1-us-east-1a success
    Sep 14 23:17:49.189: INFO: At 2022-09-14 23:14:27 +0000 UTC - event for basic-x5qk6-us-east-1-us-east-1a-0: {default-scheduler } FailedScheduling: 0/1 nodes are available: 1 pod has unbound immediate PersistentVolumeClaims.
    Sep 14 23:17:49.189: INFO: At 2022-09-14 23:14:27 +0000 UTC - event for data-basic-x5qk6-us-east-1-us-east-1a-0: {k8s.io/minikube-hostpath_fv-az626-979_558fc76d-00a8-498b-80bc-2fa6555824a2 } ProvisioningSucceeded: Successfully provisioned volume pvc-ccd7b6da-acdc-4e51-9754-cb823e77e831
    Sep 14 23:17:49.189: INFO: At 2022-09-14 23:14:27 +0000 UTC - event for data-basic-x5qk6-us-east-1-us-east-1a-0: {k8s.io/minikube-hostpath_fv-az626-979_558fc76d-00a8-498b-80bc-2fa6555824a2 } Provisioning: External provisioner is provisioning volume for claim "e2e-test-scyllacluster-jmx8m-nqlnz/data-basic-x5qk6-us-east-1-us-east-1a-0"
    Sep 14 23:17:49.189: INFO: At 2022-09-14 23:14:27 +0000 UTC - event for data-basic-x5qk6-us-east-1-us-east-1a-0: {persistentvolume-controller } ExternalProvisioning: waiting for a volume to be created, either by external provisioner "k8s.io/minikube-hostpath" or manually created by system administrator
    Sep 14 23:17:49.189: INFO: At 2022-09-14 23:14:28 +0000 UTC - event for basic-x5qk6-us-east-1-us-east-1a-0: {default-scheduler } Scheduled: Successfully assigned e2e-test-scyllacluster-jmx8m-nqlnz/basic-x5qk6-us-east-1-us-east-1a-0 to fv-az626-979
    Sep 14 23:17:49.189: INFO: At 2022-09-14 23:14:37 +0000 UTC - event for basic-x5qk6-client: {scyllacluster-controller } ServiceCreated: Service e2e-test-scyllacluster-jmx8m-nqlnz/basic-x5qk6-client created
    Sep 14 23:17:49.189: INFO: At 2022-09-14 23:14:38 +0000 UTC - event for basic-x5qk6: {scyllacluster-controller } PodDisruptionBudgetCreated: PodDisruptionBudget e2e-test-scyllacluster-jmx8m-nqlnz/basic-x5qk6 created
    Sep 14 23:17:49.189: INFO: At 2022-09-14 23:14:38 +0000 UTC - event for basic-x5qk6-us-east-1-us-east-1a-0: {scyllacluster-controller } ServiceCreated: Service e2e-test-scyllacluster-jmx8m-nqlnz/basic-x5qk6-us-east-1-us-east-1a-0 created
    Sep 14 23:17:49.189: INFO: At 2022-09-14 23:14:43 +0000 UTC - event for nodeconfig-podinfo-5f0cab8e-19f4-4e22-8749-d77b7b7d48ce: {NodeConfigCM-controller } ConfigMapCreated: ConfigMap e2e-test-scyllacluster-jmx8m-nqlnz/nodeconfig-podinfo-5f0cab8e-19f4-4e22-8749-d77b7b7d48ce created
    Sep 14 23:17:49.189: INFO: At 2022-09-14 23:14:47 +0000 UTC - event for basic-x5qk6-us-east-1-us-east-1a-0: {kubelet fv-az626-979} FailedMount: MountVolume.SetUp failed for volume "scylla-agent-config-volume" : failed to sync secret cache: timed out waiting for the condition
    Sep 14 23:17:49.189: INFO: At 2022-09-14 23:14:47 +0000 UTC - event for basic-x5qk6-us-east-1-us-east-1a-0: {kubelet fv-az626-979} FailedMount: MountVolume.SetUp failed for volume "scylla-agent-auth-token-volume" : failed to sync secret cache: timed out waiting for the condition
    Sep 14 23:17:49.189: INFO: At 2022-09-14 23:14:48 +0000 UTC - event for basic-x5qk6-us-east-1-us-east-1a-0: {kubelet fv-az626-979} FailedMount: MountVolume.SetUp failed for volume "kube-api-access-2h4nx" : failed to sync configmap cache: timed out waiting for the condition
    Sep 14 23:17:49.189: INFO: At 2022-09-14 23:14:49 +0000 UTC - event for basic-x5qk6-us-east-1-us-east-1a-0: {kubelet fv-az626-979} Pulled: Container image "docker.io/scylladb/scylla-operator:ci" already present on machine
    Sep 14 23:17:49.189: INFO: At 2022-09-14 23:14:50 +0000 UTC - event for basic-x5qk6-us-east-1-us-east-1a-0: {kubelet fv-az626-979} Started: Started container sidecar-injection
    Sep 14 23:17:49.189: INFO: At 2022-09-14 23:14:50 +0000 UTC - event for basic-x5qk6-us-east-1-us-east-1a-0: {kubelet fv-az626-979} Created: Created container sidecar-injection
    Sep 14 23:17:49.189: INFO: At 2022-09-14 23:14:59 +0000 UTC - event for basic-x5qk6-us-east-1-us-east-1a-0: {kubelet fv-az626-979} Created: Created container scylla
    Sep 14 23:17:49.189: INFO: At 2022-09-14 23:14:59 +0000 UTC - event for basic-x5qk6-us-east-1-us-east-1a-0: {kubelet fv-az626-979} Started: Started container scylla
    Sep 14 23:17:49.189: INFO: At 2022-09-14 23:14:59 +0000 UTC - event for basic-x5qk6-us-east-1-us-east-1a-0: {kubelet fv-az626-979} Pulling: Pulling image "docker.io/scylladb/scylla-manager-agent:2.6.3"
    Sep 14 23:17:49.189: INFO: At 2022-09-14 23:14:59 +0000 UTC - event for basic-x5qk6-us-east-1-us-east-1a-0: {kubelet fv-az626-979} Pulled: Container image "docker.io/scylladb/scylla:4.6.3" already present on machine
    Sep 14 23:17:49.189: INFO: At 2022-09-14 23:16:25 +0000 UTC - event for basic-x5qk6-us-east-1-us-east-1a-0: {kubelet fv-az626-979} Started: Started container scylla-manager-agent
    Sep 14 23:17:49.189: INFO: At 2022-09-14 23:16:25 +0000 UTC - event for basic-x5qk6-us-east-1-us-east-1a-0: {kubelet fv-az626-979} Created: Created container scylla-manager-agent
    Sep 14 23:17:49.189: INFO: At 2022-09-14 23:16:25 +0000 UTC - event for basic-x5qk6-us-east-1-us-east-1a-0: {kubelet fv-az626-979} Pulled: Successfully pulled image "docker.io/scylladb/scylla-manager-agent:2.6.3" in 1m25.824883618s
    Sep 14 23:17:49.189: INFO: At 2022-09-14 23:16:25 +0000 UTC - event for nodeconfig-podinfo-5f0cab8e-19f4-4e22-8749-d77b7b7d48ce: {NodeConfigCM-controller } ConfigMapUpdated: ConfigMap e2e-test-scyllacluster-jmx8m-nqlnz/nodeconfig-podinfo-5f0cab8e-19f4-4e22-8749-d77b7b7d48ce updated
    Sep 14 23:17:49.189: INFO: At 2022-09-14 23:16:28 +0000 UTC - event for basic-x5qk6-us-east-1-us-east-1a-0: {kubelet fv-az626-979} Unhealthy: Startup probe failed: Get "http://172.17.0.28:8080/healthz": dial tcp 172.17.0.28:8080: connect: connection refused
    Sep 14 23:17:49.189: INFO: At 2022-09-14 23:16:38 +0000 UTC - event for basic-x5qk6-us-east-1-us-east-1a-0: {kubelet fv-az626-979} Unhealthy: Readiness probe failed: HTTP probe failed with statuscode: 500
    Sep 14 23:17:49.189: INFO: At 2022-09-14 23:16:49 +0000 UTC - event for basic-x5qk6-us-east-1-us-east-1a-0: {kubelet fv-az626-979} Unhealthy: Readiness probe failed: HTTP probe failed with statuscode: 503
    STEP: Collecting dumps from namespace "e2e-test-scyllacluster-jmx8m-nqlnz". 09/14/22 23:17:49.189
    STEP: Destroying namespace "e2e-test-scyllacluster-jmx8m-nqlnz". 09/14/22 23:17:49.747
    STEP: Waiting for namespace "e2e-test-scyllacluster-jmx8m-nqlnz" to be removed. 09/14/22 23:17:49.769

    Ginkgo: Attempt #1 Failed.  Retrying...
    [BeforeEach] ScyllaCluster
      github.com/scylladb/scylla-operator/test/e2e/framework/framework.go:57
    STEP: Creating a new namespace 09/14/22 23:21:03.449
    Sep 14 23:21:03.482: INFO: Created namespace "e2e-test-scyllacluster-jmx8m-psjsw".
    STEP: Waiting for ServiceAccount "e2e-user" in namespace "e2e-test-scyllacluster-jmx8m-psjsw". 09/14/22 23:21:03.54
    STEP: Waiting for default ServiceAccount in namespace "e2e-test-scyllacluster-jmx8m-psjsw". 09/14/22 23:21:03.58
    [It] should allow to build connection pool using shard aware ports
      github.com/scylladb/scylla-operator/test/e2e/set/scyllacluster/scyllacluster_shardawareness.go:30
    STEP: Creating a ScyllaCluster 09/14/22 23:21:03.595
    STEP: Waiting for the ScyllaCluster to deploy 09/14/22 23:21:03.617
    Sep 14 23:22:05.660: INFO: ScyllaCluster e2e-test-scyllacluster-jmx8m-psjsw/basic-lxr8t (RV=6963) is rolled out
    STEP: Waiting for the driver to establish connection to shards 09/14/22 23:22:10.[669](https://github.com/scylladb/scylla-operator/actions/runs/3056629503/jobs/4931011388#step:12:670)
    Sep 14 23:22:10.669: INFO: Connecting to 10.101.178.97:9042 using 0 source port
    Sep 14 23:22:10.[671](https://github.com/scylladb/scylla-operator/actions/runs/3056629503/jobs/4931011388#step:12:672): INFO: Connecting to 10.101.178.97:9042 using 0 source port
    Sep 14 23:22:10.[674](https://github.com/scylladb/scylla-operator/actions/runs/3056629503/jobs/4931011388#step:12:675): INFO: Connecting to 10.101.178.97:9042 using 0 source port
    Sep 14 23:22:10.[676](https://github.com/scylladb/scylla-operator/actions/runs/3056629503/jobs/4931011388#step:12:677): INFO: Connecting to 10.101.178.97:19042 using 32768 source port
    [AfterEach] ScyllaCluster
      github.com/scylladb/scylla-operator/test/e2e/framework/framework.go:58
    STEP: Collecting dumps from namespace "e2e-test-scyllacluster-jmx8m-psjsw". 09/14/22 23:22:10.777
    STEP: Destroying namespace "e2e-test-scyllacluster-jmx8m-psjsw". 09/14/22 23:22:11.016
    STEP: Waiting for namespace "e2e-test-scyllacluster-jmx8m-psjsw" to be removed. 09/14/22 23:22:11.026
  << End Captured GinkgoWriter Output
------------------------------
@tnozicka tnozicka added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. kind/flake Categorizes issue or PR as related to a flaky test. labels Sep 15, 2022
@tnozicka tnozicka added this to the v1.8 milestone Sep 15, 2022
@zimnx zimnx self-assigned this Sep 15, 2022
zimnx added a commit to zimnx/scylla-operator that referenced this issue Sep 20, 2022
Previous implementation gave false positives because both on CI and
local environments traffic between test binary and Scylla Pods is NATed.
Because we are low on resources (we use just 2 shards) probability of
hitting expected shard was high even for random assignment.
It wasn't spotted because message about wrong shard assignment is
printed out by the driver and no error is reported anywhere.

Gocql doesn't expose any way of checking if shardaware ports are used
successfully, hence I implemented simple driver within a test, able to
send initial packet and reading to which shard connection was
established.

Fixes scylladb#1028
zimnx added a commit to zimnx/scylla-operator that referenced this issue Sep 20, 2022
Both on CI and on local environments traffic between test binary and
Scylla Pods is NATed. Previous implementation gave false positives
because we are low on resources (we use just 2 shards) so probability of
hitting expected shard was high even for random assignment.
It wasn't spotted because message about wrong shard assignment is
printed out by the driver to stdout and no error is reported anywhere.

Gocql doesn't expose any way of checking if shardaware ports are used
successfully, hence I implemented simple driver within a test, able to
send initial packet and reading to which shard connection was established.

It also check in-cluster connectivity which is how clients are
connecting from, compared to from outside in previous appraoch.

Fixes scylladb#1028
zimnx added a commit to zimnx/scylla-operator that referenced this issue Sep 20, 2022
Both on CI and on local environments traffic between test binary and
Scylla Pods is NATed. Previous implementation gave false positives
because we are low on resources (we use just 2 shards) so probability of
hitting expected shard was high even for random assignment.
It wasn't spotted because message about wrong shard assignment is
printed out by the driver to stdout and no error is reported anywhere.

Gocql doesn't expose any way of checking if shardaware ports are used
successfully, hence I implemented simple driver within a test, able to
send initial packet and reading to which shard connection was established.

It also check in-cluster connectivity which is how clients are
connecting from, compared to from outside in previous appraoch.

Fixes scylladb#1028
zimnx added a commit to zimnx/scylla-operator that referenced this issue Sep 20, 2022
Both on CI and on local environments traffic between test binary and
Scylla Pods is NATed. Previous implementation gave false positives
because we are low on resources (we use just 2 shards) so probability of
hitting expected shard was high even for random assignment.
It wasn't spotted because message about wrong shard assignment is
printed out by the driver to stdout and no error is reported anywhere.

Gocql doesn't expose any way of checking if shardaware ports are used
successfully, hence I implemented simple driver within a test, able to
send initial packet and reading to which shard connection was established.

It also check in-cluster connectivity which is how clients are
connecting from, compared to from outside in previous appraoch.

Fixes scylladb#1028
zimnx added a commit to zimnx/scylla-operator that referenced this issue Sep 20, 2022
Both on CI and on local environments traffic between test binary and
Scylla Pods is NATed. Previous implementation gave false positives
because we are low on resources (we use just 2 shards) so probability of
hitting expected shard was high even for random assignment.
It wasn't spotted because message about wrong shard assignment is
printed out by the driver to stdout and no error is reported anywhere.

Gocql doesn't expose any way of checking if shardaware ports are used
successfully, hence I implemented simple driver within a test, able to
send initial packet and reading to which shard connection was established.

It also check in-cluster connectivity which is how clients are
connecting from, compared to from outside in previous appraoch.

Fixes scylladb#1028
zimnx added a commit to zimnx/scylla-operator that referenced this issue Sep 20, 2022
Both on CI and on local environments traffic between test binary and
Scylla Pods is NATed. Previous implementation gave false positives
because we are low on resources (we use just 2 shards) so probability of
hitting expected shard was high even for random assignment.
It wasn't spotted because message about wrong shard assignment is
printed out by the driver to stdout and no error is reported anywhere.

Gocql doesn't expose any way of checking if shardaware ports are used
successfully, hence I implemented simple driver within a test, able to
send initial packet and reading to which shard connection was established.

It also check in-cluster connectivity which is how clients are
connecting from, compared to from outside in previous appraoch.

Fixes scylladb#1028
zimnx added a commit to zimnx/scylla-operator that referenced this issue Sep 26, 2022
Both on CI and on local environments traffic between test binary and
Scylla Pods is NATed. Previous implementation gave false positives
because we are low on resources (we use just 2 shards) so probability of
hitting expected shard was high even for random assignment.
It wasn't spotted because message about wrong shard assignment is
printed out by the driver to stdout and no error is reported anywhere.

Gocql doesn't expose any way of checking if shardaware ports are used
successfully, hence I implemented simple driver within a test, able to
send initial packet and reading to which shard connection was established.

It also check in-cluster connectivity which is how clients are
connecting from, compared to from outside in previous appraoch.

Fixes scylladb#1028
@tnozicka tnozicka modified the milestones: v1.8, v1.9 Oct 6, 2022
@tnozicka tnozicka removed this from the v1.9 milestone Aug 16, 2023
@zimnx
Copy link
Collaborator

zimnx commented Oct 18, 2023

4 shards on 2 core machine might not be good idea, maybe it's better to skip it on not-big-enough nodes

@tnozicka
Copy link
Member Author

skips usually mean bugs slipping :(

@zimnx
Copy link
Collaborator

zimnx commented Oct 19, 2023

Prow CI is big enough, it fails only on GH Actions where we have limited resources.

@tnozicka
Copy link
Member Author

ok, then I'd defer this one to when we finish the migration

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/flake Categorizes issue or PR as related to a flaky test. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants