schema disagreement error attempting to insert data after the Scylla upgrade #1150

vponomaryov · 2023-01-13T17:44:18Z

Issue description

This issue is a regression.
It is unknown if this issue is a regression.

Describe your issue in detail and steps it took to produce it.

Impact

User cannot perform some queries.

How frequently does it reproduce?

It was reproduced 2 times from 2.

Installation details

Kernel Version: 5.15.0-1020-gke
Scylla version (or git commit hash): 5.0.5-20221009.5a97a1060 with build-id 5009658b834aaf68970135bfc84f964b66ea4dee
Relocatable Package: http://downloads.scylladb.com/downloads/scylla/relocatable/scylladb-5.1/scylla-x86_64-package-5.1.2.0.20221225.4c0f7ea09893.tar.gz
Operator Image: scylladb/scylla-operator:1.8.0-rc.0
Operator Helm Version: 1.8.0-rc.0
Operator Helm Repository: https://storage.googleapis.com/scylla-operator-charts/latest
Cluster size: 3 nodes (n1-standard-8)

Scylla Nodes used in this run:
No resources left at the end of the run

OS / Image: N/A (k8s-gke: us-east1-b)

Test: upgrade-major-scylla-k8s-gke
Test id: 207bdbdc-673c-4c52-ac37-44faddabe464
Test name: scylla-operator/operator-1.8/upgrade/upgrade-major-scylla-k8s-gke
Test config file(s):

kubernetes-scylla-upgrade.yaml

Running Scylla upgrade from 5.0.5-0.20221009.5a97a1060 with build-id 5009658b834aaf68970135bfc84f964b66ea4dee to 5.1.2-0.20221225.4c0f7ea09893 with build-id 4817fe236d57eca203f35b1dbb4bfe43cab72590 on K8S backend (GKE) we faced following problem:

Logs with error:

> Executing CQL 'INSERT INTO ks_no_range_ghost_test.users (KEY, password) VALUES ('user1', 'ch@ngem3a')' ... 
> Retrying request after UE. Attempt #0                                                                
> [control connection] Schemas mismatched, trying again                                                
                                                                                                       
... 48 more attempts each 200ms ...                                                                        
                                                                                                       
> [control connection] Schemas mismatched, trying again                                                
G > Node 10.108.5.7:9042 is reporting a schema disagreement: {UUID('8af28221-bae0-35a1-bd3c-7bb3a7caf720'): [<DefaultEndPoint: 10.112.2.191:9042>, <DefaultEndPoint: 10.108.5.7:9042>], UUID('6e637294-2c1e-3fc9-a573-a83a5fc50e8f'): [<DefaultEndPoint: 10.112.9.194:9042>]}
> Skipping schema refresh due to lack of schema agreement                                              
> [control connection] Waiting for schema agreement                                                    
> Retrying request after UE. Attempt scylladb/scylladb#1                                                                
> [control connection] Schemas mismatched, trying again                                                
> Retrying request after UE. Attempt scylladb/scylladb#2                                                                
> Retrying request after UE. Attempt scylladb/scylladb#3                                                                
> Retrying request after UE. Attempt scylladb/scylladb#4                                                                
> INSERT INTO ks_no_range_ghost_test.users (KEY, password) VALUES ('user1', 'ch@ngem3a') < t:2023-01-04 17:37:51,821 f:fill_db_data.py l:3255 c:sdcm.fill_db_data    p:ERROR > INSERT INTO ks_no_range_ghost_test.users (KEY, password) VALUES ('user1', 'ch@ngem3a')
> Traceback (most recent call last):                                                                   
>   File "/home/ubuntu/scylla-cluster-tests/sdcm/fill_db_data.py", line 3252, in _run_db_queries       
>     res = session.execute(item['queries'][i])                                                        
>   File "/home/ubuntu/scylla-cluster-tests/sdcm/utils/common.py", line 1552, in execute_verbose       
>     return execute_orig(*args, **kwargs)                                                             
>   File "cassandra/cluster.py", line 2699, in cassandra.cluster.Session.execute                       
>   File "cassandra/cluster.py", line 5006, in cassandra.cluster.ResponseFuture.result                 
> cassandra.Unavailable: Error from server: code=1000 [Unavailable exception] message="Cannot achieve consistency level for cl QUORUM. Requires 1, alive 0" info={'consistency': 'QUORUM', 'required_replicas': 1, 'alive_replicas': 0}

We run lots of commands, but the same one failed in the same place in 2 different test runs.

And second test run was using enterprise Scylla upgrading from the 2021.1.17-0.20221221.5318a7fec with build-id d4378bd13d179b4bbcde7bdc82b92d8cc71c52d8 to the 2022.1.3-0.20220922.539a55e35 with build-id d1fb2faafd95058a04aad30b675ff7d2b930278d version.

</summary>

Restore Monitor Stack command: $ hydra investigate show-monitor 207bdbdc-673c-4c52-ac37-44faddabe464
Restore monitor on AWS instance using Jenkins job
Show all stored logs command: $ hydra investigate show-logs 207bdbdc-673c-4c52-ac37-44faddabe464

Logs:

db-cluster-207bdbdc.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/207bdbdc-673c-4c52-ac37-44faddabe464/20230104_175036/db-cluster-207bdbdc.tar.gz
sct-runner-207bdbdc.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/207bdbdc-673c-4c52-ac37-44faddabe464/20230104_175036/sct-runner-207bdbdc.tar.gz
monitor-set-207bdbdc.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/207bdbdc-673c-4c52-ac37-44faddabe464/20230104_175036/monitor-set-207bdbdc.tar.gz
loader-set-207bdbdc.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/207bdbdc-673c-4c52-ac37-44faddabe464/20230104_175036/loader-set-207bdbdc.tar.gz
kubernetes-207bdbdc.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/207bdbdc-673c-4c52-ac37-44faddabe464/20230104_175036/kubernetes-207bdbdc.tar.gz

Jenkins job URL
</details>

The text was updated successfully, but these errors were encountered:

fruch · 2023-01-18T09:01:30Z

node-1 (the one being upgraded)

INFO  2023-01-04 17:37:22,424 [shard 0] cql_server_controller - Starting listening for CQL clients on 0.0.0.0:9042 (unencrypted, non-shard-aware)
INFO  2023-01-04 17:37:22,424 [shard 0] cql_server_controller - Starting listening for CQL clients on 0.0.0.0:19042 (unencrypted, shard-aware)

node-2, update the schema:

INFO  2023-01-04 17:37:31,453 [shard 0] schema_tables - Schema version changed to 8af28221-bae0-35a1-bd3c-7bb3a7caf720

node-1, notice the other nodes 2min after, and get the new schema from them:

INFO  2023-01-04 17:38:26,725 [shard 0] gossip - InetAddress 10.112.2.191 is now UP, status = NORMAL
INFO  2023-01-04 17:38:26,726 [shard 0] gossip - InetAddress 10.112.8.121 is now UP, status = NORMAL
INFO  2023-01-04 17:38:26,727 [shard 0] storage_service - Node 10.112.2.191 state jump to normal
INFO  2023-01-04 17:38:26,731 [shard 0] storage_service - Node 10.112.8.121 state jump to normal
...
INFO  2023-01-04 17:39:26,726 [shard 0] migration_manager - Requesting schema pull from 10.112.2.191:0
INFO  2023-01-04 17:39:26,726 [shard 0] migration_manager - Pulling schema from 10.112.2.191:0
INFO  2023-01-04 17:39:26,726 [shard 0] migration_manager - Requesting schema pull from 10.112.8.121:0
INFO  2023-01-04 17:39:26,726 [shard 0] migration_manager - Pulling schema from 10.112.8.121:0
INFO  2023-01-04 17:39:26,833 [shard 0] schema_tables - Altering keyspace_fill_db_data.table_options_test id=6e04e400-8c50-11ed-8fbc-394aebb27b6e version=9373d136-8b14-33a9-9d8b-191e567e7e6b
INFO  2023-01-04 17:39:26,834 [shard 0] schema_tables - Altering keyspace_fill_db_data.table_options_test_scylla_cdc_log id=6e04e402-8c50-11ed-8fbc-394aebb27b6e version=e1098738-72f1-347f-805c-454472f91653
...
INFO  2023-01-04 17:39:26,862 [shard 0] schema_tables - Schema version changed to 8af28221-bae0-35a1-bd3c-7bb3a7caf720
INFO  2023-01-04 17:39:26,863 [shard 0] migration_manager - Schema merge with 10.112.2.191:0 completed
INFO  2023-01-04 17:39:27,078 [shard 0] schema_tables - Schema version changed to 8af28221-bae0-35a1-bd3c-7bb3a7caf720

@vponomaryov, I think this might be an k8s related issue, and we'll need @scylladb/team-operator to take a closer look here.

vponomaryov · 2023-01-18T10:32:31Z

@fruch
Since the https://github.com/orgs/scylladb/teams/team-operator doesn't have members yet, need to mention people explicitly:
@tnozicka , @zimnx , @rzetelskik
Please, look at it.

DoronArazii · 2023-01-18T12:12:07Z

@fruch why is it marked as master/triage?

fruch · 2023-01-18T15:22:57Z

@fruch why is it marked as master/triage?

It was a suspected core issue, seems like it's not the case.

zimnx · 2023-01-19T10:39:56Z

It was a suspected core issue, seems like it's not the case.

Why do you think it's k8s related?

What's the condition you wait for before you issue an insert?

fruch · 2023-01-19T12:04:12Z

It was a suspected core issue, seems like it's not the case.

Why do you think it's k8s related?

What's the condition you wait for before you issue an insert?

we are waiting like that:

    def wait_till_scylla_is_upgraded_on_all_nodes(self, target_version: str) -> None:
        def _is_cluster_upgraded() -> bool:
            for node in self.db_cluster.nodes:
                node.forget_scylla_version()
                if node.scylla_version != target_version or not node.db_up:
                    return False
            return True
        wait.wait_for(
            func=_is_cluster_upgraded,
            step=30,
            text="Waiting until all nodes in the cluster are upgraded",
            timeout=900,
            throw_exc=True,
        )

that the version is what we except, and the the CQL port is open.

what else should we need to wait for before using the cluster ?

zimnx · 2023-01-19T12:17:15Z

In my view, you should look at ScyllaCluster.Status.Conditions - Available=True,Progressing=False,Degraded=False.

Not keeping quorum throught rollouts it's a known issue on k8s - #1077

fruch · 2023-01-22T09:14:15Z

In my view, you should look at ScyllaCluster.Status.Conditions - Available=True,Progressing=False,Degraded=False.

We will look at checking this status as well

Not keeping quorum throught rollouts it's a known issue on k8s - scylladb/scylla-operator#1077

@mykaul is it's agreed it's a operator issue, can you help us move it there ?

@zimnx seems like there's some strong arguments with the suggest solution for #1077, is there still moving forward ?

rzetelskik · 2023-01-23T12:51:19Z

@fruch #1077 is waiting for the input and reviews from the rest of the team in #1108

scylla-operator-bot · 2024-06-19T13:16:21Z

The Scylla Operator project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 30d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out

/lifecycle stale

scylla-operator-bot · 2024-07-23T10:45:00Z

The Scylla Operator project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 30d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out

/lifecycle stale

scylla-operator-bot · 2024-08-23T10:41:38Z

The Scylla Operator project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 30d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out

/lifecycle rotten

scylla-operator-bot · 2024-09-22T10:47:16Z

The Scylla Operator project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 30d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out

/close not-planned

scylla-operator-bot · 2024-09-22T10:47:19Z

@scylla-operator-bot[bot]: Closing this issue, marking it as "Not Planned".

In response to this:

The Scylla Operator project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 30d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen

Mark this issue as fresh with /remove-lifecycle rotten

Offer to help out

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

mykaul transferred this issue from scylladb/scylladb Jan 22, 2023

scylla-operator-bot bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 23, 2024

scylla-operator-bot bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Aug 23, 2024

scylla-operator-bot bot closed this as not planned Won't fix, can't repro, duplicate, stale Sep 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

schema disagreement error attempting to insert data after the Scylla upgrade #1150

schema disagreement error attempting to insert data after the Scylla upgrade #1150

vponomaryov commented Jan 13, 2023

fruch commented Jan 18, 2023

vponomaryov commented Jan 18, 2023

DoronArazii commented Jan 18, 2023

fruch commented Jan 18, 2023

zimnx commented Jan 19, 2023

fruch commented Jan 19, 2023

zimnx commented Jan 19, 2023

fruch commented Jan 22, 2023

rzetelskik commented Jan 23, 2023 •

edited

Loading

scylla-operator-bot commented Jun 19, 2024

scylla-operator-bot bot commented Jul 23, 2024

scylla-operator-bot bot commented Aug 23, 2024

scylla-operator-bot bot commented Sep 22, 2024

scylla-operator-bot bot commented Sep 22, 2024

schema disagreement error attempting to insert data after the Scylla upgrade #1150

schema disagreement error attempting to insert data after the Scylla upgrade #1150

Comments

vponomaryov commented Jan 13, 2023

Issue description

Impact

How frequently does it reproduce?

Installation details

Logs:

fruch commented Jan 18, 2023

vponomaryov commented Jan 18, 2023

DoronArazii commented Jan 18, 2023

fruch commented Jan 18, 2023

zimnx commented Jan 19, 2023

fruch commented Jan 19, 2023

zimnx commented Jan 19, 2023

fruch commented Jan 22, 2023

rzetelskik commented Jan 23, 2023 • edited Loading

scylla-operator-bot commented Jun 19, 2024

scylla-operator-bot bot commented Jul 23, 2024

scylla-operator-bot bot commented Aug 23, 2024

scylla-operator-bot bot commented Sep 22, 2024

scylla-operator-bot bot commented Sep 22, 2024

rzetelskik commented Jan 23, 2023 •

edited

Loading