Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Leader load balancing stuck on cluster #2366

Closed
ndeodhar opened this issue Sep 18, 2019 · 2 comments
Closed

Leader load balancing stuck on cluster #2366

ndeodhar opened this issue Sep 18, 2019 · 2 comments
Assignees
Labels
area/cdc Change Data Capture

Comments

@ndeodhar
Copy link
Contributor

We saw this situation on a producer cluster during scale out, while testing 2DC. It's unclear if this can happen on any cluster or whether it's specific to 2DC enabled cluster:

From master logs:

I0918 18:44:27.375653 24181 async_rpc_tasks.cc:873] Prep Leader step down 1, leader_uuid=7fbbb21dc0a7410e8f1c4fd27ca06556, change_ts_uuid=7fbbb21dc0a7410e8f1c4fd27ca06556
I0918 18:44:27.375659 24181 async_rpc_tasks.cc:902] Stepping down leader 7fbbb21dc0a7410e8f1c4fd27ca06556 for tablet 9db51856aaa740b3a7a15f081e293148
I0918 18:44:27.376495 17304 async_rpc_tasks.cc:922] Leader step down done attempt=1, leader_uuid=7fbbb21dc0a7410e8f1c4fd27ca06556, change_uuid=7fbbb21dc0a7410e8f1c4fd27ca06556, error=code: NOT_THE_LEADER status { code: ILLEGAL_STATE message: "Not currently leader" source_file: "../../src/yb/consensus/raft_consensus.cc" source_line: 634 errors: "\000" }, failed=1, should_remove=0 for tablet 9db51856aaa740b3a7a15f081e293148.

On tablet server:

I0918 18:45:33.423285 24662 raft_consensus.cc:1980] T b22abd1644034a6ebfef0b9099026743 P 1371e236277b4f43a7d1891a3f834b00 [term 6686 FOLLOWER]: Pre-election. Granting vote for candidate a0909fefd2c4480d850441eb521cbca5 in term 6687
I0918 18:45:36.310081 17503 raft_consensus.cc:813] T b22abd1644034a6ebfef0b9099026743 P 1371e236277b4f43a7d1891a3f834b00 [term 6686 FOLLOWER]: ReportFailDetected: Starting NORMAL_ELECTION...
I0918 18:45:36.310144 17503 raft_consensus.cc:492] T b22abd1644034a6ebfef0b9099026743 P 1371e236277b4f43a7d1891a3f834b00 [term 6686 FOLLOWER]: Fail of leader 7fbbb21dc0a7410e8f1c4fd27ca06556 detected. Triggering leader pre-election, mode=NORMAL_ELECTION
I0918 18:45:36.310163 17503 raft_consensus.cc:2856] T b22abd1644034a6ebfef0b9099026743 P 1371e236277b4f43a7d1891a3f834b00 [term 6686 FOLLOWER]: Snoozing failure detection for 3.109s
I0918 18:45:36.310214 17503 raft_consensus.cc:535] T b22abd1644034a6ebfef0b9099026743 P 1371e236277b4f43a7d1891a3f834b00 [term 6686 FOLLOWER]: Starting pre-election with config: opid_index: -1 peers { permanent_uuid: "1371e236277b4f43a7d1891a3f834b00" member_type: VOTER last_known_private_addr { host: "172.152.53.203" port: 9100 } cloud_info { placement_cloud: "aws" placement_region: "us-east-1" placement_zone: "us-east-1c" } } peers { permanent_uuid: "a0909fefd2c4480d850441eb521cbca5" member_type: VOTER last_known_private_addr { host: "172.152.39.249" port: 9100 } cloud_info { placement_cloud: "aws" placement_region: "us-east-1" placement_zone: "us-east-1b" } } peers { permanent_uuid: "7fbbb21dc0a7410e8f1c4fd27ca06556" member_type: VOTER last_known_private_addr { host: "172.152.21.168" port: 9100 } cloud_info { placement_cloud: "aws" placement_region: "us-east-1" placement_zone: "us-east-1a" } }
I0918 18:45:36.310267 17503 leader_election.cc:215] T b22abd1644034a6ebfef0b9099026743 P 1371e236277b4f43a7d1891a3f834b00 [CANDIDATE]: Term 6687 pre-election: Requesting vote from peer a0909fefd2c4480d850441eb521cbca5
I0918 18:45:36.310305 17503 leader_election.cc:215] T b22abd1644034a6ebfef0b9099026743 P 1371e236277b4f43a7d1891a3f834b00 [CANDIDATE]: Term 6687 pre-election: Requesting vote from peer 7fbbb21dc0a7410e8f1c4fd27ca06556
I0918 18:45:36.524571 27706 raft_consensus.cc:1980] T b22abd1644034a6ebfef0b9099026743 P 1371e236277b4f43a7d1891a3f834b00 [term 6686 FOLLOWER]: Pre-election. Granting vote for candidate a0909fefd2c4480d850441eb521cbca5 in term 6687
I0918 18:45:36.699296 25468 raft_consensus.cc:2446] T 9db51856aaa740b3a7a15f081e293148 P 1371e236277b4f43a7d1891a3f834b00 [term 13185 LEADER]:  Leader pre-election vote request: Denying vote to candidate a0909fefd2c4480d850441eb521cbca5 for term 13186 because replica is either leader or believes a valid leader to be alive. Time left: 9223198496.233s

and on another tserver:

I0918 18:47:16.137734 25080 raft_consensus.cc:535] T b22abd1644034a6ebfef0b9099026743 P a0909fefd2c4480d850441eb521cbca5 [term 6688 FOLLOWER]: Starting pre-election with config: opid_index: -1 peers { permanent_uuid: "1371e236277b4f43a7d1891a3f834b00" member_type: VOTER last_known_private_addr { host: "172.152.53.203" port: 9100 } cloud_info { placement_cloud: "aws" placement_region: "us-east-1" placement_zone: "us-east-1c" } } peers { permanent_uuid: "a0909fefd2c4480d850441eb521cbca5" member_type: VOTER last_known_private_addr { host: "172.152.39.249" port: 9100 } cloud_info { placement_cloud: "aws" placement_region: "us-east-1" placement_zone: "us-east-1b" } } peers { permanent_uuid: "7fbbb21dc0a7410e8f1c4fd27ca06556" member_type: VOTER last_known_private_addr { host: "172.152.21.168" port: 9100 } cloud_info { placement_cloud: "aws" placement_region: "us-east-1" placement_zone: "us-east-1a" } }
I0918 18:47:16.137768 25080 leader_election.cc:215] T b22abd1644034a6ebfef0b9099026743 P a0909fefd2c4480d850441eb521cbca5 [CANDIDATE]: Term 6689 pre-election: Requesting vote from peer 1371e236277b4f43a7d1891a3f834b00
I0918 18:47:16.137881 25080 leader_election.cc:215] T b22abd1644034a6ebfef0b9099026743 P a0909fefd2c4480d850441eb521cbca5 [CANDIDATE]: Term 6689 pre-election: Requesting vote from peer 7fbbb21dc0a7410e8f1c4fd27ca06556
I0918 18:47:16.482267 25080 raft_consensus.cc:813] T 9db51856aaa740b3a7a15f081e293148 P a0909fefd2c4480d850441eb521cbca5 [term 13193 FOLLOWER]: ReportFailDetected: Starting NORMAL_ELECTION...
I0918 18:47:16.482322 25080 raft_consensus.cc:492] T 9db51856aaa740b3a7a15f081e293148 P a0909fefd2c4480d850441eb521cbca5 [term 13193 FOLLOWER]: Fail of leader 7fbbb21dc0a7410e8f1c4fd27ca06556 detected. Triggering leader pre-election, mode=NORMAL_ELECTION
I0918 18:47:16.482337 25080 raft_consensus.cc:2856] T 9db51856aaa740b3a7a15f081e293148 P a0909fefd2c4480d850441eb521cbca5 [term 13193 FOLLOWER]: Snoozing failure detection for 3.178s

Attached screenshot showing stuck load balancing state.
Screen Shot 2019-09-18 at 11 48 59 AM

@ndeodhar ndeodhar added priority/high High Priority area/cdc Change Data Capture labels Sep 18, 2019
@ndeodhar ndeodhar added this to the v2.1 milestone Sep 18, 2019
@ndeodhar ndeodhar removed the priority/high High Priority label Oct 7, 2019
@rahuldesirazu
Copy link
Contributor

Has this issue been resolved in another commit? @ndeodhar

@bmatican bmatican removed this from the v2.1 milestone Jun 8, 2020
@bmatican
Copy link
Contributor

bmatican commented Jun 8, 2020

Closing as we are not able to repro this anymore, as per @rahuldesirazu

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cdc Change Data Capture
Projects
None yet
Development

No branches or pull requests

3 participants