Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

schema disagreement error attempting to insert data after the Scylla upgrade #1150

Closed
1 of 2 tasks
vponomaryov opened this issue Jan 13, 2023 · 13 comments
Closed
1 of 2 tasks
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@vponomaryov
Copy link
Contributor

Issue description

  • This issue is a regression.
  • It is unknown if this issue is a regression.

Describe your issue in detail and steps it took to produce it.

Impact

User cannot perform some queries.

How frequently does it reproduce?

It was reproduced 2 times from 2.

Installation details

Kernel Version: 5.15.0-1020-gke
Scylla version (or git commit hash): 5.0.5-20221009.5a97a1060 with build-id 5009658b834aaf68970135bfc84f964b66ea4dee
Relocatable Package: http://downloads.scylladb.com/downloads/scylla/relocatable/scylladb-5.1/scylla-x86_64-package-5.1.2.0.20221225.4c0f7ea09893.tar.gz
Operator Image: scylladb/scylla-operator:1.8.0-rc.0
Operator Helm Version: 1.8.0-rc.0
Operator Helm Repository: https://storage.googleapis.com/scylla-operator-charts/latest
Cluster size: 3 nodes (n1-standard-8)

Scylla Nodes used in this run:
No resources left at the end of the run

OS / Image: N/A (k8s-gke: us-east1-b)

Test: upgrade-major-scylla-k8s-gke
Test id: 207bdbdc-673c-4c52-ac37-44faddabe464
Test name: scylla-operator/operator-1.8/upgrade/upgrade-major-scylla-k8s-gke
Test config file(s):

<details>
<summary>

Running Scylla upgrade from 5.0.5-0.20221009.5a97a1060 with build-id 5009658b834aaf68970135bfc84f964b66ea4dee to 5.1.2-0.20221225.4c0f7ea09893 with build-id 4817fe236d57eca203f35b1dbb4bfe43cab72590 on K8S backend (GKE) we faced following problem:

Logs with error:

> Executing CQL 'INSERT INTO ks_no_range_ghost_test.users (KEY, password) VALUES ('user1', 'ch@ngem3a')' ... 
> Retrying request after UE. Attempt #0                                                                
> [control connection] Schemas mismatched, trying again                                                
                                                                                                       
... 48 more attempts each 200ms ...                                                                        
                                                                                                       
> [control connection] Schemas mismatched, trying again                                                
G > Node 10.108.5.7:9042 is reporting a schema disagreement: {UUID('8af28221-bae0-35a1-bd3c-7bb3a7caf720'): [<DefaultEndPoint: 10.112.2.191:9042>, <DefaultEndPoint: 10.108.5.7:9042>], UUID('6e637294-2c1e-3fc9-a573-a83a5fc50e8f'): [<DefaultEndPoint: 10.112.9.194:9042>]}
> Skipping schema refresh due to lack of schema agreement                                              
> [control connection] Waiting for schema agreement                                                    
> Retrying request after UE. Attempt scylladb/scylladb#1                                                                
> [control connection] Schemas mismatched, trying again                                                
> Retrying request after UE. Attempt scylladb/scylladb#2                                                                
> Retrying request after UE. Attempt scylladb/scylladb#3                                                                
> Retrying request after UE. Attempt scylladb/scylladb#4                                                                
> INSERT INTO ks_no_range_ghost_test.users (KEY, password) VALUES ('user1', 'ch@ngem3a') < t:2023-01-04 17:37:51,821 f:fill_db_data.py l:3255 c:sdcm.fill_db_data    p:ERROR > INSERT INTO ks_no_range_ghost_test.users (KEY, password) VALUES ('user1', 'ch@ngem3a')
> Traceback (most recent call last):                                                                   
>   File "/home/ubuntu/scylla-cluster-tests/sdcm/fill_db_data.py", line 3252, in _run_db_queries       
>     res = session.execute(item['queries'][i])                                                        
>   File "/home/ubuntu/scylla-cluster-tests/sdcm/utils/common.py", line 1552, in execute_verbose       
>     return execute_orig(*args, **kwargs)                                                             
>   File "cassandra/cluster.py", line 2699, in cassandra.cluster.Session.execute                       
>   File "cassandra/cluster.py", line 5006, in cassandra.cluster.ResponseFuture.result                 
> cassandra.Unavailable: Error from server: code=1000 [Unavailable exception] message="Cannot achieve consistency level for cl QUORUM. Requires 1, alive 0" info={'consistency': 'QUORUM', 'required_replicas': 1, 'alive_replicas': 0}

We run lots of commands, but the same one failed in the same place in 2 different test runs.

And second test run was using enterprise Scylla upgrading from the 2021.1.17-0.20221221.5318a7fec with build-id d4378bd13d179b4bbcde7bdc82b92d8cc71c52d8 to the 2022.1.3-0.20220922.539a55e35 with build-id d1fb2faafd95058a04aad30b675ff7d2b930278d version.

</summary>

  • Restore Monitor Stack command: $ hydra investigate show-monitor 207bdbdc-673c-4c52-ac37-44faddabe464
  • Restore monitor on AWS instance using Jenkins job
  • Show all stored logs command: $ hydra investigate show-logs 207bdbdc-673c-4c52-ac37-44faddabe464

Logs:

Jenkins job URL
</details>

@fruch
Copy link

fruch commented Jan 18, 2023

node-1 (the one being upgraded)

INFO  2023-01-04 17:37:22,424 [shard 0] cql_server_controller - Starting listening for CQL clients on 0.0.0.0:9042 (unencrypted, non-shard-aware)
INFO  2023-01-04 17:37:22,424 [shard 0] cql_server_controller - Starting listening for CQL clients on 0.0.0.0:19042 (unencrypted, shard-aware)

node-2, update the schema:

INFO  2023-01-04 17:37:31,453 [shard 0] schema_tables - Schema version changed to 8af28221-bae0-35a1-bd3c-7bb3a7caf720

node-1, notice the other nodes 2min after, and get the new schema from them:

INFO  2023-01-04 17:38:26,725 [shard 0] gossip - InetAddress 10.112.2.191 is now UP, status = NORMAL
INFO  2023-01-04 17:38:26,726 [shard 0] gossip - InetAddress 10.112.8.121 is now UP, status = NORMAL
INFO  2023-01-04 17:38:26,727 [shard 0] storage_service - Node 10.112.2.191 state jump to normal
INFO  2023-01-04 17:38:26,731 [shard 0] storage_service - Node 10.112.8.121 state jump to normal
...
INFO  2023-01-04 17:39:26,726 [shard 0] migration_manager - Requesting schema pull from 10.112.2.191:0
INFO  2023-01-04 17:39:26,726 [shard 0] migration_manager - Pulling schema from 10.112.2.191:0
INFO  2023-01-04 17:39:26,726 [shard 0] migration_manager - Requesting schema pull from 10.112.8.121:0
INFO  2023-01-04 17:39:26,726 [shard 0] migration_manager - Pulling schema from 10.112.8.121:0
INFO  2023-01-04 17:39:26,833 [shard 0] schema_tables - Altering keyspace_fill_db_data.table_options_test id=6e04e400-8c50-11ed-8fbc-394aebb27b6e version=9373d136-8b14-33a9-9d8b-191e567e7e6b
INFO  2023-01-04 17:39:26,834 [shard 0] schema_tables - Altering keyspace_fill_db_data.table_options_test_scylla_cdc_log id=6e04e402-8c50-11ed-8fbc-394aebb27b6e version=e1098738-72f1-347f-805c-454472f91653
...
INFO  2023-01-04 17:39:26,862 [shard 0] schema_tables - Schema version changed to 8af28221-bae0-35a1-bd3c-7bb3a7caf720
INFO  2023-01-04 17:39:26,863 [shard 0] migration_manager - Schema merge with 10.112.2.191:0 completed
INFO  2023-01-04 17:39:27,078 [shard 0] schema_tables - Schema version changed to 8af28221-bae0-35a1-bd3c-7bb3a7caf720

@vponomaryov, I think this might be an k8s related issue, and we'll need @scylladb/team-operator to take a closer look here.

@vponomaryov
Copy link
Contributor Author

@fruch
Since the https://github.com/orgs/scylladb/teams/team-operator doesn't have members yet, need to mention people explicitly:
@tnozicka , @zimnx , @rzetelskik
Please, look at it.

@DoronArazii
Copy link

@fruch why is it marked as master/triage?

@fruch
Copy link

fruch commented Jan 18, 2023

@fruch why is it marked as master/triage?

It was a suspected core issue, seems like it's not the case.

@zimnx
Copy link
Collaborator

zimnx commented Jan 19, 2023

It was a suspected core issue, seems like it's not the case.

Why do you think it's k8s related?

What's the condition you wait for before you issue an insert?

@fruch
Copy link

fruch commented Jan 19, 2023

It was a suspected core issue, seems like it's not the case.

Why do you think it's k8s related?

What's the condition you wait for before you issue an insert?

we are waiting like that:

    def wait_till_scylla_is_upgraded_on_all_nodes(self, target_version: str) -> None:
        def _is_cluster_upgraded() -> bool:
            for node in self.db_cluster.nodes:
                node.forget_scylla_version()
                if node.scylla_version != target_version or not node.db_up:
                    return False
            return True
        wait.wait_for(
            func=_is_cluster_upgraded,
            step=30,
            text="Waiting until all nodes in the cluster are upgraded",
            timeout=900,
            throw_exc=True,
        )

that the version is what we except, and the the CQL port is open.

what else should we need to wait for before using the cluster ?

@zimnx
Copy link
Collaborator

zimnx commented Jan 19, 2023

In my view, you should look at ScyllaCluster.Status.Conditions - Available=True,Progressing=False,Degraded=False.

Not keeping quorum throught rollouts it's a known issue on k8s - #1077

@fruch
Copy link

fruch commented Jan 22, 2023

In my view, you should look at ScyllaCluster.Status.Conditions - Available=True,Progressing=False,Degraded=False.

We will look at checking this status as well

Not keeping quorum throught rollouts it's a known issue on k8s - scylladb/scylla-operator#1077

@mykaul is it's agreed it's a operator issue, can you help us move it there ?

@zimnx seems like there's some strong arguments with the suggest solution for #1077, is there still moving forward ?

@mykaul mykaul transferred this issue from scylladb/scylladb Jan 22, 2023
@rzetelskik
Copy link
Member

rzetelskik commented Jan 23, 2023

@fruch #1077 is waiting for the input and reviews from the rest of the team in #1108

@scylla-operator-bot
Copy link
Collaborator

The Scylla Operator project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 30d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out

/lifecycle stale

1 similar comment
Copy link
Contributor

The Scylla Operator project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 30d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out

/lifecycle stale

@scylla-operator-bot scylla-operator-bot bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 23, 2024
Copy link
Contributor

The Scylla Operator project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 30d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out

/lifecycle rotten

@scylla-operator-bot scylla-operator-bot bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Aug 23, 2024
Copy link
Contributor

The Scylla Operator project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 30d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out

/close not-planned

Copy link
Contributor

@scylla-operator-bot[bot]: Closing this issue, marking it as "Not Planned".

In response to this:

The Scylla Operator project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 30d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@scylla-operator-bot scylla-operator-bot bot closed this as not planned Won't fix, can't repro, duplicate, stale Sep 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests

6 participants