Leader load balancing stuck on cluster #2366

ndeodhar · 2019-09-18T18:51:50Z

We saw this situation on a producer cluster during scale out, while testing 2DC. It's unclear if this can happen on any cluster or whether it's specific to 2DC enabled cluster:

From master logs:

I0918 18:44:27.375653 24181 async_rpc_tasks.cc:873] Prep Leader step down 1, leader_uuid=7fbbb21dc0a7410e8f1c4fd27ca06556, change_ts_uuid=7fbbb21dc0a7410e8f1c4fd27ca06556
I0918 18:44:27.375659 24181 async_rpc_tasks.cc:902] Stepping down leader 7fbbb21dc0a7410e8f1c4fd27ca06556 for tablet 9db51856aaa740b3a7a15f081e293148
I0918 18:44:27.376495 17304 async_rpc_tasks.cc:922] Leader step down done attempt=1, leader_uuid=7fbbb21dc0a7410e8f1c4fd27ca06556, change_uuid=7fbbb21dc0a7410e8f1c4fd27ca06556, error=code: NOT_THE_LEADER status { code: ILLEGAL_STATE message: "Not currently leader" source_file: "../../src/yb/consensus/raft_consensus.cc" source_line: 634 errors: "\000" }, failed=1, should_remove=0 for tablet 9db51856aaa740b3a7a15f081e293148.

On tablet server:

I0918 18:45:33.423285 24662 raft_consensus.cc:1980] T b22abd1644034a6ebfef0b9099026743 P 1371e236277b4f43a7d1891a3f834b00 [term 6686 FOLLOWER]: Pre-election. Granting vote for candidate a0909fefd2c4480d850441eb521cbca5 in term 6687
I0918 18:45:36.310081 17503 raft_consensus.cc:813] T b22abd1644034a6ebfef0b9099026743 P 1371e236277b4f43a7d1891a3f834b00 [term 6686 FOLLOWER]: ReportFailDetected: Starting NORMAL_ELECTION...
I0918 18:45:36.310144 17503 raft_consensus.cc:492] T b22abd1644034a6ebfef0b9099026743 P 1371e236277b4f43a7d1891a3f834b00 [term 6686 FOLLOWER]: Fail of leader 7fbbb21dc0a7410e8f1c4fd27ca06556 detected. Triggering leader pre-election, mode=NORMAL_ELECTION
I0918 18:45:36.310163 17503 raft_consensus.cc:2856] T b22abd1644034a6ebfef0b9099026743 P 1371e236277b4f43a7d1891a3f834b00 [term 6686 FOLLOWER]: Snoozing failure detection for 3.109s
I0918 18:45:36.310214 17503 raft_consensus.cc:535] T b22abd1644034a6ebfef0b9099026743 P 1371e236277b4f43a7d1891a3f834b00 [term 6686 FOLLOWER]: Starting pre-election with config: opid_index: -1 peers { permanent_uuid: "1371e236277b4f43a7d1891a3f834b00" member_type: VOTER last_known_private_addr { host: "172.152.53.203" port: 9100 } cloud_info { placement_cloud: "aws" placement_region: "us-east-1" placement_zone: "us-east-1c" } } peers { permanent_uuid: "a0909fefd2c4480d850441eb521cbca5" member_type: VOTER last_known_private_addr { host: "172.152.39.249" port: 9100 } cloud_info { placement_cloud: "aws" placement_region: "us-east-1" placement_zone: "us-east-1b" } } peers { permanent_uuid: "7fbbb21dc0a7410e8f1c4fd27ca06556" member_type: VOTER last_known_private_addr { host: "172.152.21.168" port: 9100 } cloud_info { placement_cloud: "aws" placement_region: "us-east-1" placement_zone: "us-east-1a" } }
I0918 18:45:36.310267 17503 leader_election.cc:215] T b22abd1644034a6ebfef0b9099026743 P 1371e236277b4f43a7d1891a3f834b00 [CANDIDATE]: Term 6687 pre-election: Requesting vote from peer a0909fefd2c4480d850441eb521cbca5
I0918 18:45:36.310305 17503 leader_election.cc:215] T b22abd1644034a6ebfef0b9099026743 P 1371e236277b4f43a7d1891a3f834b00 [CANDIDATE]: Term 6687 pre-election: Requesting vote from peer 7fbbb21dc0a7410e8f1c4fd27ca06556
I0918 18:45:36.524571 27706 raft_consensus.cc:1980] T b22abd1644034a6ebfef0b9099026743 P 1371e236277b4f43a7d1891a3f834b00 [term 6686 FOLLOWER]: Pre-election. Granting vote for candidate a0909fefd2c4480d850441eb521cbca5 in term 6687
I0918 18:45:36.699296 25468 raft_consensus.cc:2446] T 9db51856aaa740b3a7a15f081e293148 P 1371e236277b4f43a7d1891a3f834b00 [term 13185 LEADER]:  Leader pre-election vote request: Denying vote to candidate a0909fefd2c4480d850441eb521cbca5 for term 13186 because replica is either leader or believes a valid leader to be alive. Time left: 9223198496.233s

and on another tserver:

I0918 18:47:16.137734 25080 raft_consensus.cc:535] T b22abd1644034a6ebfef0b9099026743 P a0909fefd2c4480d850441eb521cbca5 [term 6688 FOLLOWER]: Starting pre-election with config: opid_index: -1 peers { permanent_uuid: "1371e236277b4f43a7d1891a3f834b00" member_type: VOTER last_known_private_addr { host: "172.152.53.203" port: 9100 } cloud_info { placement_cloud: "aws" placement_region: "us-east-1" placement_zone: "us-east-1c" } } peers { permanent_uuid: "a0909fefd2c4480d850441eb521cbca5" member_type: VOTER last_known_private_addr { host: "172.152.39.249" port: 9100 } cloud_info { placement_cloud: "aws" placement_region: "us-east-1" placement_zone: "us-east-1b" } } peers { permanent_uuid: "7fbbb21dc0a7410e8f1c4fd27ca06556" member_type: VOTER last_known_private_addr { host: "172.152.21.168" port: 9100 } cloud_info { placement_cloud: "aws" placement_region: "us-east-1" placement_zone: "us-east-1a" } }
I0918 18:47:16.137768 25080 leader_election.cc:215] T b22abd1644034a6ebfef0b9099026743 P a0909fefd2c4480d850441eb521cbca5 [CANDIDATE]: Term 6689 pre-election: Requesting vote from peer 1371e236277b4f43a7d1891a3f834b00
I0918 18:47:16.137881 25080 leader_election.cc:215] T b22abd1644034a6ebfef0b9099026743 P a0909fefd2c4480d850441eb521cbca5 [CANDIDATE]: Term 6689 pre-election: Requesting vote from peer 7fbbb21dc0a7410e8f1c4fd27ca06556
I0918 18:47:16.482267 25080 raft_consensus.cc:813] T 9db51856aaa740b3a7a15f081e293148 P a0909fefd2c4480d850441eb521cbca5 [term 13193 FOLLOWER]: ReportFailDetected: Starting NORMAL_ELECTION...
I0918 18:47:16.482322 25080 raft_consensus.cc:492] T 9db51856aaa740b3a7a15f081e293148 P a0909fefd2c4480d850441eb521cbca5 [term 13193 FOLLOWER]: Fail of leader 7fbbb21dc0a7410e8f1c4fd27ca06556 detected. Triggering leader pre-election, mode=NORMAL_ELECTION
I0918 18:47:16.482337 25080 raft_consensus.cc:2856] T 9db51856aaa740b3a7a15f081e293148 P a0909fefd2c4480d850441eb521cbca5 [term 13193 FOLLOWER]: Snoozing failure detection for 3.178s

Attached screenshot showing stuck load balancing state.

The text was updated successfully, but these errors were encountered:

rahuldesirazu · 2019-11-20T00:16:51Z

Has this issue been resolved in another commit? @ndeodhar

bmatican · 2020-06-08T23:16:12Z

Closing as we are not able to repro this anymore, as per @rahuldesirazu

ndeodhar added priority/high High Priority area/cdc Change Data Capture labels Sep 18, 2019

ndeodhar added this to the v2.1 milestone Sep 18, 2019

ndeodhar assigned rahuldesirazu Sep 18, 2019

ndeodhar removed the priority/high High Priority label Oct 7, 2019

bmatican removed this from the v2.1 milestone Jun 8, 2020

bmatican closed this as completed Jun 8, 2020

nyndyny mentioned this issue Oct 2, 2022

[Snyk] Upgrade nock from 13.1.0 to 13.2.9 nyndyny/yugabyte-db#4

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Leader load balancing stuck on cluster #2366

Leader load balancing stuck on cluster #2366

ndeodhar commented Sep 18, 2019

rahuldesirazu commented Nov 20, 2019

bmatican commented Jun 8, 2020

Leader load balancing stuck on cluster #2366

Leader load balancing stuck on cluster #2366

Comments

ndeodhar commented Sep 18, 2019

rahuldesirazu commented Nov 20, 2019

bmatican commented Jun 8, 2020