Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: replicate/wide failed #86107

Closed
cockroach-teamcity opened this issue Aug 14, 2022 · 14 comments · Fixed by #86844
Closed

roachtest: replicate/wide failed #86107

cockroach-teamcity opened this issue Aug 14, 2022 · 14 comments · Fixed by #86844
Assignees
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-kv KV Team
Milestone

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Aug 14, 2022

roachtest.replicate/wide failed with artifacts on master @ d25cb57ccd9bc643ce9058ebd2057cab36b69ad5:

test artifacts and logs in: /artifacts/replicate/wide/run_1
	allocator.go:381,allocator.go:388,allocator.go:459,test_runner.go:896: dial tcp 34.73.248.131:26257: connect: connection refused

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=1 , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

/cc @cockroachdb/kv-triage

This test on roachdash | Improve this report!

Jira issue: CRDB-18577

@cockroach-teamcity cockroach-teamcity added branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Aug 14, 2022
@cockroach-teamcity cockroach-teamcity added this to the 22.2 milestone Aug 14, 2022
@blathers-crl blathers-crl bot added the T-kv KV Team label Aug 14, 2022
@cockroach-teamcity
Copy link
Member Author

roachtest.replicate/wide failed with artifacts on master @ 41db784cb97d2749b162020c2c821979094f87b1:

test artifacts and logs in: /artifacts/replicate/wide/run_1
	allocator.go:381,allocator.go:388,allocator.go:459,test_runner.go:896: dial tcp 34.138.92.252:26257: connect: connection refused

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=1 , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

roachtest.replicate/wide failed with artifacts on master @ f4042d47fa8062a612c38d4696eb6bee9cee7c21:

test artifacts and logs in: /artifacts/replicate/wide/run_1
	allocator.go:381,allocator.go:388,allocator.go:459,test_runner.go:896: read tcp 172.17.0.3:35282->34.148.230.118:26257: read: connection reset by peer

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=1 , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

roachtest.replicate/wide failed with artifacts on master @ b173a16715e71e94115820374da1eb350b3b459d:

test artifacts and logs in: /artifacts/replicate/wide/run_1
	allocator.go:381,allocator.go:388,allocator.go:459,test_runner.go:896: dial tcp 35.227.104.176:26257: connect: connection refused

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=1 , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

roachtest.replicate/wide failed with artifacts on master @ 5c2c62ecf1bea60c807edc6b4da22d900ad4ae03:

test artifacts and logs in: /artifacts/replicate/wide/run_1
	allocator.go:381,allocator.go:388,allocator.go:459,test_runner.go:896: dial tcp 34.148.82.172:26257: connect: connection refused

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=1 , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

This test on roachdash | Improve this report!

@irfansharif
Copy link
Contributor

Looks like a real bug/panic.

I220818 13:50:15.015061 157448 kv/kvserver/replicate_queue.go:587 ⋮ [n1,replicate,s1,r36/11:‹/Table/3{5-6}›] 7700  repair needed (‹remove dead voter›), enqueuing
I220818 13:50:15.033440 157477 kv/kvserver/replicate_queue.go:587 ⋮ [n1,replicate,s1,r34/1:‹/Table/3{3-4}›] 7701  repair needed (‹remove dead voter›), enqueuing
I220818 13:50:15.055561 157513 kv/kvserver/replicate_queue.go:587 ⋮ [n1,replicate,s1,r32/1:‹/{NamespaceTab…-Table/32}›] 7702  repair needed (‹remove dead voter›), enqueuing
I220818 13:50:15.078175 157549 kv/kvserver/replicate_queue.go:587 ⋮ [n1,replicate,s1,r45/1:‹/Table/4{4-5}›] 7703  repair needed (‹remove dead voter›), enqueuing
I220818 13:50:15.091773 155745 kv/kvserver/replicate_queue.go:779 ⋮ [n1,replicate,s1,r11/1:‹/Table/{8-11}›] 7704  next replica action: ‹remove dead voter›
I220818 13:50:15.091805 155745 kv/kvserver/replicate_queue.go:1482 ⋮ [n1,replicate,s1,r11/1:‹/Table/{8-11}›] 7705  removing dead ‹voter› (n9,s9):7 from store
I220818 13:50:15.092063 155745 kv/kvserver/replica_command.go:2284 ⋮ [n1,replicate,s1,r11/1:‹/Table/{8-11}›] 7706  change replicas (add [] remove [(n9,s9):7VOTER_DEMOTING_LEARNER]): existing descriptor r11:‹/Table/{8-11}› [(n1,s1):1, (n5,s5):10, (n3,s3):3, (n6,s6):4, (n2,s2):11, (n4,s4):6, (n9,s9):7, next=12, gen=30]
I220818 13:50:15.105050 157601 kv/kvserver/replicate_queue.go:587 ⋮ [n1,replicate,s1,r45/1:‹/Table/4{4-5}›] 7707  repair needed (‹remove dead voter›), enqueuing
I220818 13:50:15.133465 155745 kv/kvserver/replica_raft.go:347 ⋮ [n1,s1,r11/1:‹/Table/{8-11}›] 7708  proposing ENTER_JOINT(r7 l7) [(n9,s9):7VOTER_DEMOTING_LEARNER]: after=[(n1,s1):1 (n5,s5):10 (n3,s3):3 (n6,s6):4 (n2,s2):11 (n4,s4):6 (n9,s9):7VOTER_DEMOTING_LEARNER] next=12
I220818 13:50:15.134778 155745 kv/kvserver/replica_command.go:2284 ⋮ [n1,replicate,s1,r11/1:‹/Table/{8-11}›] 7709  change replicas (add [] remove []): existing descriptor r11:‹/Table/{8-11}› [(n1,s1):1, (n5,s5):10, (n3,s3):3, (n6,s6):4, (n2,s2):11, (n4,s4):6, (n9,s9):7VOTER_DEMOTING_LEARNER, next=12, gen=31]
I220818 13:50:15.138297 157660 kv/kvserver/replicate_queue.go:587 ⋮ [n1,replicate,s1,r15/1:‹/Table/1{4-5}›] 7710  repair needed (‹remove dead voter›), enqueuing
I220818 13:50:15.144192 155745 kv/kvserver/replica_raft.go:347 ⋮ [n1,s1,r11/1:‹/Table/{8-11}›] 7711  proposing LEAVE_JOINT: after=[(n1,s1):1 (n5,s5):10 (n3,s3):3 (n6,s6):4 (n2,s2):11 (n4,s4):6 (n9,s9):7LEARNER] next=12
I220818 13:50:15.145541 155745 kv/kvserver/replica_command.go:2284 ⋮ [n1,replicate,s1,r11/1:‹/Table/{8-11}›] 7712  change replicas (add [] remove [(n9,s9):7LEARNER]): existing descriptor r11:‹/Table/{8-11}› [(n1,s1):1, (n5,s5):10, (n3,s3):3, (n6,s6):4, (n2,s2):11, (n4,s4):6, (n9,s9):7LEARNER, next=12, gen=32]
I220818 13:50:15.153647 157697 kv/kvserver/replicate_queue.go:587 ⋮ [n1,replicate,s1,r29/13:‹/Table/2{8-9}›] 7713  repair needed (‹remove dead voter›), enqueuing
I220818 13:50:15.173611 155745 kv/kvserver/replica_raft.go:347 ⋮ [n1,s1,r11/1:‹/Table/{8-11}›] 7714  proposing SIMPLE(r7) [(n9,s9):7LEARNER]: after=[(n1,s1):1 (n5,s5):10 (n3,s3):3 (n6,s6):4 (n2,s2):11 (n4,s4):6] next=12
E220818 13:50:15.174793 155745 1@util/log/logcrash/crash_reporting.go:174 ⋮ [n1,replicate,s1,r11/1:‹/Table/{8-11}›] 7715  a panic has occurred!
E220818 13:50:15.174793 155745 1@util/log/logcrash/crash_reporting.go:174 ⋮ [n1,replicate,s1,r11/1:‹/Table/{8-11}›] 7715 +panic: ‹unsupported AllocatorAction: remove dead voter›
E220818 13:50:15.174793 155745 1@util/log/logcrash/crash_reporting.go:174 ⋮ [n1,replicate,s1,r11/1:‹/Table/{8-11}›] 7715 +(1) attached stack trace
E220818 13:50:15.174793 155745 1@util/log/logcrash/crash_reporting.go:174 ⋮ [n1,replicate,s1,r11/1:‹/Table/{8-11}›] 7715 +  -- stack trace:
E220818 13:50:15.174793 155745 1@util/log/logcrash/crash_reporting.go:174 ⋮ [n1,replicate,s1,r11/1:‹/Table/{8-11}›] 7715 +  | runtime.gopanic
E220818 13:50:15.174793 155745 1@util/log/logcrash/crash_reporting.go:174 ⋮ [n1,replicate,s1,r11/1:‹/Table/{8-11}›] 7715 +  | 	GOROOT/src/runtime/panic.go:838
E220818 13:50:15.174793 155745 1@util/log/logcrash/crash_reporting.go:174 ⋮ [n1,replicate,s1,r11/1:‹/Table/{8-11}›] 7715 +  | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*ReplicateQueueMetrics).trackResultByAllocatorAction
E220818 13:50:15.174793 155745 1@util/log/logcrash/crash_reporting.go:174 ⋮ [n1,replicate,s1,r11/1:‹/Table/{8-11}›] 7715 +  | 	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replicate_queue.go:498
E220818 13:50:15.174793 155745 1@util/log/logcrash/crash_reporting.go:174 ⋮ [n1,replicate,s1,r11/1:‹/Table/{8-11}›] 7715 +  | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*replicateQueue).processOneChange
E220818 13:50:15.174793 155745 1@util/log/logcrash/crash_reporting.go:174 ⋮ [n1,replicate,s1,r11/1:‹/Table/{8-11}›] 7715 +  | 	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replicate_queue.go:910
E220818 13:50:15.174793 155745 1@util/log/logcrash/crash_reporting.go:174 ⋮ [n1,replicate,s1,r11/1:‹/Table/{8-11}›] 7715 +  | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*replicateQueue).process
E220818 13:50:15.174793 155745 1@util/log/logcrash/crash_reporting.go:174 ⋮ [n1,replicate,s1,r11/1:‹/Table/{8-11}›] 7715 +  | 	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replicate_queue.go:664
E220818 13:50:15.174793 155745 1@util/log/logcrash/crash_reporting.go:174 ⋮ [n1,replicate,s1,r11/1:‹/Table/{8-11}›] 7715 +  | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*baseQueue).processReplica.func1
E220818 13:50:15.174793 155745 1@util/log/logcrash/crash_reporting.go:174 ⋮ [n1,replicate,s1,r11/1:‹/Table/{8-11}›] 7715 +  | 	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/queue.go:980
E220818 13:50:15.174793 155745 1@util/log/logcrash/crash_reporting.go:174 ⋮ [n1,replicate,s1,r11/1:‹/Table/{8-11}›] 7715 +  | github.com/cockroachdb/cockroach/pkg/util/contextutil.RunWithTimeout
E220818 13:50:15.174793 155745 1@util/log/logcrash/crash_reporting.go:174 ⋮ [n1,replicate,s1,r11/1:‹/Table/{8-11}›] 7715 +  | 	github.com/cockroachdb/cockroach/pkg/util/contextutil/context.go:91
E220818 13:50:15.174793 155745 1@util/log/logcrash/crash_reporting.go:174 ⋮ [n1,replicate,s1,r11/1:‹/Table/{8-11}›] 7715 +  | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*baseQueue).processReplica
E220818 13:50:15.174793 155745 1@util/log/logcrash/crash_reporting.go:174 ⋮ [n1,replicate,s1,r11/1:‹/Table/{8-11}›] 7715 +  | 	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/queue.go:939
E220818 13:50:15.174793 155745 1@util/log/logcrash/crash_reporting.go:174 ⋮ [n1,replicate,s1,r11/1:‹/Table/{8-11}›] 7715 +  | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*baseQueue).processReplicasInPurgatory.func1.2
E220818 13:50:15.174793 155745 1@util/log/logcrash/crash_reporting.go:174 ⋮ [n1,replicate,s1,r11/1:‹/Table/{8-11}›] 7715 +  | 	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/queue.go:1229
E220818 13:50:15.174793 155745 1@util/log/logcrash/crash_reporting.go:174 ⋮ [n1,replicate,s1,r11/1:‹/Table/{8-11}›] 7715 +  | github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunTask
E220818 13:50:15.174793 155745 1@util/log/logcrash/crash_reporting.go:174 ⋮ [n1,replicate,s1,r11/1:‹/Table/{8-11}›] 7715 +  | 	github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:324
E220818 13:50:15.174793 155745 1@util/log/logcrash/crash_reporting.go:174 ⋮ [n1,replicate,s1,r11/1:‹/Table/{8-11}›] 7715 +  | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*baseQueue).processReplicasInPurgatory.func1
E220818 13:50:15.174793 155745 1@util/log/logcrash/crash_reporting.go:174 ⋮ [n1,replicate,s1,r11/1:‹/Table/{8-11}›] 7715 +  | 	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/queue.go:1227
E220818 13:50:15.174793 155745 1@util/log/logcrash/crash_reporting.go:174 ⋮ [n1,replicate,s1,r11/1:‹/Table/{8-11}›] 7715 +  | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*baseQueue).processReplicasInPurgatory
E220818 13:50:15.174793 155745 1@util/log/logcrash/crash_reporting.go:174 ⋮ [n1,replicate,s1,r11/1:‹/Table/{8-11}›] 7715 +  | 	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/queue.go:1236
E220818 13:50:15.174793 155745 1@util/log/logcrash/crash_reporting.go:174 ⋮ [n1,replicate,s1,r11/1:‹/Table/{8-11}›] 7715 +  | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*baseQueue).addToPurgatoryLocked.func2
E220818 13:50:15.174793 155745 1@util/log/logcrash/crash_reporting.go:174 ⋮ [n1,replicate,s1,r11/1:‹/Table/{8-11}›] 7715 +  | 	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/queue.go:1171
E220818 13:50:15.174793 155745 1@util/log/logcrash/crash_reporting.go:174 ⋮ [n1,replicate,s1,r11/1:‹/Table/{8-11}›] 7715 +  | github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTaskEx.func2
E220818 13:50:15.174793 155745 1@util/log/logcrash/crash_reporting.go:174 ⋮ [n1,replicate,s1,r11/1:‹/Table/{8-11}›] 7715 +  | 	github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:489
E220818 13:50:15.174793 155745 1@util/log/logcrash/crash_reporting.go:174 ⋮ [n1,replicate,s1,r11/1:‹/Table/{8-11}›] 7715 +  | runtime.goexit
E220818 13:50:15.174793 155745 1@util/log/logcrash/crash_reporting.go:174 ⋮ [n1,replicate,s1,r11/1:‹/Table/{8-11}›] 7715 +  | 	GOROOT/src/runtime/asm_amd64.s:1571
E220818 13:50:15.174793 155745 1@util/log/logcrash/crash_reporting.go:174 ⋮ [n1,replicate,s1,r11/1:‹/Table/{8-11}›] 7715 +Wraps: (2) panic: ‹unsupported AllocatorAction: remove dead voter›
E220818 13:50:15.174793 155745 1@util/log/logcrash/crash_reporting.go:174 ⋮ [n1,replicate,s1,r11/1:‹/Table/{8-11}›] 7715 +Error types: (1) *withstack.withStack (2) *errutil.leafError

@irfansharif
Copy link
Contributor

Coming from here:

default:
panic(fmt.Sprintf("unsupported AllocatorAction: %v", action))

Looks like it was recently introduced, in 62b5e8b. Are we missing an enum in the switch statement?

@cockroach-teamcity

This comment was marked as duplicate.

@irfansharif
Copy link
Contributor

In addition to ensuring full enum coverage, since it's only metrics related, the fallback behavior could be just ignoring instead of panicking. To make this less fragile to future enum additions if any. Go's type safety here isn't great so there's no way to tell, but a full server crash sounds severe.

@cockroach-teamcity

This comment was marked as duplicate.

@cockroach-teamcity

This comment was marked as duplicate.

@cockroach-teamcity

This comment was marked as duplicate.

@cockroach-teamcity
Copy link
Member Author

roachtest.replicate/wide failed with artifacts on master @ 80c274877a917580af62be6eb0cd48c8c7ae9c08:

test artifacts and logs in: /artifacts/replicate/wide/run_1
	allocator.go:381,allocator.go:388,allocator.go:459,test_runner.go:896: dial tcp 34.138.255.136:26257: connect: connection refused

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=1 , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

roachtest.replicate/wide failed with artifacts on master @ 003c0360de8b64319b5f0f127b99be91dbdca8a3:

test artifacts and logs in: /artifacts/replicate/wide/run_1
	allocator.go:381,allocator.go:388,allocator.go:459,test_runner.go:896: read tcp 172.17.0.3:35196->104.196.132.65:26257: read: connection reset by peer

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=1 , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

roachtest.replicate/wide failed with artifacts on master @ 524fd14da3fefcd849f44a835cc5f88f5dbdadcc:

test artifacts and logs in: /artifacts/replicate/wide/run_1
	allocator.go:381,allocator.go:388,allocator.go:459,test_runner.go:896: dial tcp 34.75.72.235:26257: connect: connection refused

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=1 , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

This test on roachdash | Improve this report!

@craig craig bot closed this as completed in #86844 Aug 26, 2022
@craig craig bot closed this as completed in 82e2a30 Aug 26, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-kv KV Team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants