Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: tpcc/mixed-headroom/multiple-upgrades/n5cpu16 failed #93133

Closed
cockroach-teamcity opened this issue Dec 6, 2022 · 13 comments · Fixed by #93487
Closed

roachtest: tpcc/mixed-headroom/multiple-upgrades/n5cpu16 failed #93133

cockroach-teamcity opened this issue Dec 6, 2022 · 13 comments · Fixed by #93487
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. T-testeng TestEng Team
Milestone

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Dec 6, 2022

roachtest.tpcc/mixed-headroom/multiple-upgrades/n5cpu16 failed with artifacts on master @ 146556e19f5e4fdc8c3e6a623b280cc33aee4d18:

test artifacts and logs in: /artifacts/tpcc/mixed-headroom/multiple-upgrades/n5cpu16/run_1
(test_impl.go:291).Fatal: 2: expected version 1000022.2-10, got 1000022.1-18
(test_impl.go:291).Fatal: monitor failure: monitor task failed: output in run_160619.244622188_n5_cockroach_workload_run_tpcc: ./cockroach workload run tpcc --warehouses=909 --histograms=perf/stats.json  --ramp=5m0s --duration=1h40m0s --prometheus-port=2112 --pprofport=33333  {pgurl:1-4} returned: context canceled

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=16 , ROACHTEST_encrypted=true , ROACHTEST_fs=ext4 , ROACHTEST_localSSD=true , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

/cc @cockroachdb/test-eng

This test on roachdash | Improve this report!

Jira issue: CRDB-22180

@cockroach-teamcity cockroach-teamcity added branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Dec 6, 2022
@cockroach-teamcity cockroach-teamcity added this to the 23.1 milestone Dec 6, 2022
@blathers-crl blathers-crl bot added the T-testeng TestEng Team label Dec 6, 2022
@srosenberg
Copy link
Member

Upgrade starts around 16:11. Migration job for 22.1-20 is in progress at 16:11:41,

I221206 16:11:41.025838 696446 upgrade/upgrademanager/manager.go:302 ⋮ [n1,intExec=‹set-version›,migration-mgr] 363  stepping through 1000022.1-20
I221206 16:11:41.035498 696446 upgrade/upgrademanager/manager.go:593 ⋮ [n1,intExec=‹set-version›,migration-mgr] 364  found existing migration job 820093526208806915 for version 1000022.1-20 in status running, waiting
I221206 16:11:41.035559 696446 upgrade/upgrademanager/manager.go:519 ⋮ [n1,intExec=‹set-version›,migration-mgr] 365  waiting for ‹Upgrade to 1000022.1-20: "update system.statement_diagnostics_requests to support sampling probabilities"›

The job is resumed at 16:13,

I221206 16:13:40.522861 1010804 jobs/adopt.go:245 â‹® [n1] 369  job 820093526208806915: resuming execution
I221206 16:13:40.537339 1010807 jobs/registry.go:1373 â‹® [n1] 370  MIGRATION job 820093526208806915: stepping through state running

It fails immediately,

I221206 16:13:40.555024 1010807 upgrade/upgrades/schema_changes.go:86 ⋮ [n1,job=‹MIGRATION id=820093526208806915›,upgrade=1000022.1-20] 375  performing table migration operation create-stmt-diag-reqs-v3-index
I221206 16:13:40.558339 1010807 jobs/registry.go:1446 ⋮ [n1] 376  job 820093526208806915: running execution encountered retriable error: non-cancelable: running migration for 1000022.1-20: error while validating descriptors during operation create-stmt-diag-reqs-v3-index: expected descriptor doesn't match with found descriptor: ‹StoreColumnNames[1]: "min_execution_latency" != "sampling_probability"›

with the following stack trace,

I221206 16:13:40.558339 1010807 jobs/registry.go:1446 â‹® [n1] 376 +(1) attached stack trace
I221206 16:13:40.558339 1010807 jobs/registry.go:1446 â‹® [n1] 376 +  -- stack trace:
I221206 16:13:40.558339 1010807 jobs/registry.go:1446 â‹® [n1] 376 +  | github.com/cockroachdb/cockroach/pkg/jobs.(*Registry).stepThroughStateMachine
I221206 16:13:40.558339 1010807 jobs/registry.go:1446 â‹® [n1] 376 +  | 	github.com/cockroachdb/cockroach/pkg/jobs/registry.go:1444
I221206 16:13:40.558339 1010807 jobs/registry.go:1446 â‹® [n1] 376 +  | github.com/cockroachdb/cockroach/pkg/jobs.(*Registry).runJob
I221206 16:13:40.558339 1010807 jobs/registry.go:1446 â‹® [n1] 376 +  | 	github.com/cockroachdb/cockroach/pkg/jobs/adopt.go:412
I221206 16:13:40.558339 1010807 jobs/registry.go:1446 â‹® [n1] 376 +  | github.com/cockroachdb/cockroach/pkg/jobs.(*Registry).resumeJob.func1
I221206 16:13:40.558339 1010807 jobs/registry.go:1446 â‹® [n1] 376 +  | 	github.com/cockroachdb/cockroach/pkg/jobs/adopt.go:332
I221206 16:13:40.558339 1010807 jobs/registry.go:1446 â‹® [n1] 376 +  | github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTaskEx.func2
I221206 16:13:40.558339 1010807 jobs/registry.go:1446 â‹® [n1] 376 +  | 	github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:470
I221206 16:13:40.558339 1010807 jobs/registry.go:1446 â‹® [n1] 376 +Wraps: (2) non-cancelable
I221206 16:13:40.558339 1010807 jobs/registry.go:1446 â‹® [n1] 376 +Wraps: (3) attached stack trace
I221206 16:13:40.558339 1010807 jobs/registry.go:1446 â‹® [n1] 376 +  -- stack trace:
I221206 16:13:40.558339 1010807 jobs/registry.go:1446 â‹® [n1] 376 +  | github.com/cockroachdb/cockroach/pkg/upgrade/upgradejob.resumer.Resume
I221206 16:13:40.558339 1010807 jobs/registry.go:1446 â‹® [n1] 376 +  | 	github.com/cockroachdb/cockroach/pkg/upgrade/upgradejob/upgrade_job.go:131
I221206 16:13:40.558339 1010807 jobs/registry.go:1446 â‹® [n1] 376 +  | github.com/cockroachdb/cockroach/pkg/jobs.(*Registry).stepThroughStateMachine.func2
I221206 16:13:40.558339 1010807 jobs/registry.go:1446 â‹® [n1] 376 +  | 	github.com/cockroachdb/cockroach/pkg/jobs/registry.go:1413
I221206 16:13:40.558339 1010807 jobs/registry.go:1446 â‹® [n1] 376 +  | github.com/cockroachdb/cockroach/pkg/jobs.(*Registry).stepThroughStateMachine
I221206 16:13:40.558339 1010807 jobs/registry.go:1446 â‹® [n1] 376 +  | 	github.com/cockroachdb/cockroach/pkg/jobs/registry.go:1414
I221206 16:13:40.558339 1010807 jobs/registry.go:1446 â‹® [n1] 376 +  | github.com/cockroachdb/cockroach/pkg/jobs.(*Registry).runJob
I221206 16:13:40.558339 1010807 jobs/registry.go:1446 â‹® [n1] 376 +  | 	github.com/cockroachdb/cockroach/pkg/jobs/adopt.go:412
I221206 16:13:40.558339 1010807 jobs/registry.go:1446 â‹® [n1] 376 +  | github.com/cockroachdb/cockroach/pkg/jobs.(*Registry).resumeJob.func1
I221206 16:13:40.558339 1010807 jobs/registry.go:1446 â‹® [n1] 376 +  | 	github.com/cockroachdb/cockroach/pkg/jobs/adopt.go:332
I221206 16:13:40.558339 1010807 jobs/registry.go:1446 â‹® [n1] 376 +  | github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTaskEx.func2
I221206 16:13:40.558339 1010807 jobs/registry.go:1446 â‹® [n1] 376 +  | 	github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:470
I221206 16:13:40.558339 1010807 jobs/registry.go:1446 â‹® [n1] 376 +Wraps: (4) running migration for 1000022.1-20
I221206 16:13:40.558339 1010807 jobs/registry.go:1446 â‹® [n1] 376 +Wraps: (5) attached stack trace
I221206 16:13:40.558339 1010807 jobs/registry.go:1446 â‹® [n1] 376 +  -- stack trace:
I221206 16:13:40.558339 1010807 jobs/registry.go:1446 â‹® [n1] 376 +  | github.com/cockroachdb/cockroach/pkg/upgrade/upgrades.migrateTable
I221206 16:13:40.558339 1010807 jobs/registry.go:1446 â‹® [n1] 376 +  | 	github.com/cockroachdb/cockroach/pkg/upgrade/upgrades/schema_changes.go:112
I221206 16:13:40.558339 1010807 jobs/registry.go:1446 â‹® [n1] 376 +  | [...repeated from below...]
I221206 16:13:40.558339 1010807 jobs/registry.go:1446 â‹® [n1] 376 +Wraps: (6) error while validating descriptors during operation create-stmt-diag-reqs-v3-index
I221206 16:13:40.558339 1010807 jobs/registry.go:1446 â‹® [n1] 376 +Wraps: (7) attached stack trace
I221206 16:13:40.558339 1010807 jobs/registry.go:1446 â‹® [n1] 376 +  -- stack trace:
I221206 16:13:40.558339 1010807 jobs/registry.go:1446 â‹® [n1] 376 +  | github.com/cockroachdb/cockroach/pkg/upgrade/upgrades.ensureProtoMessagesAreEqual
I221206 16:13:40.558339 1010807 jobs/registry.go:1446 â‹® [n1] 376 +  | 	github.com/cockroachdb/cockroach/pkg/upgrade/upgrades/schema_changes.go:175
I221206 16:13:40.558339 1010807 jobs/registry.go:1446 â‹® [n1] 376 +  | github.com/cockroachdb/cockroach/pkg/upgrade/upgrades.hasIndex
I221206 16:13:40.558339 1010807 jobs/registry.go:1446 â‹® [n1] 376 +  | 	github.com/cockroachdb/cockroach/pkg/upgrade/upgrades/schema_changes.go:297
I221206 16:13:40.558339 1010807 jobs/registry.go:1446 â‹® [n1] 376 +  | github.com/cockroachdb/cockroach/pkg/upgrade/upgrades.migrateTable
I221206 16:13:40.558339 1010807 jobs/registry.go:1446 â‹® [n1] 376 +  | 	github.com/cockroachdb/cockroach/pkg/upgrade/upgrades/schema_changes.go:110
I221206 16:13:40.558339 1010807 jobs/registry.go:1446 â‹® [n1] 376 +  | github.com/cockroachdb/cockroach/pkg/upgrade/upgrades.sampledStmtDiagReqsMigration
I221206 16:13:40.558339 1010807 jobs/registry.go:1446 â‹® [n1] 376 +  | 	github.com/cockroachdb/cockroach/pkg/upgrade/upgrades/sampled_stmt_diagnostics_requests.go:86
I221206 16:13:40.558339 1010807 jobs/registry.go:1446 â‹® [n1] 376 +  | github.com/cockroachdb/cockroach/pkg/upgrade.(*TenantUpgrade).Run
I221206 16:13:40.558339 1010807 jobs/registry.go:1446 â‹® [n1] 376 +  | 	github.com/cockroachdb/cockroach/pkg/upgrade/tenant_upgrade.go:127
I221206 16:13:40.558339 1010807 jobs/registry.go:1446 â‹® [n1] 376 +  | github.com/cockroachdb/cockroach/pkg/upgrade/upgradejob.resumer.Resume
I221206 16:13:40.558339 1010807 jobs/registry.go:1446 â‹® [n1] 376 +  | 	github.com/cockroachdb/cockroach/pkg/upgrade/upgradejob/upgrade_job.go:126
I221206 16:13:40.558339 1010807 jobs/registry.go:1446 â‹® [n1] 376 +  | github.com/cockroachdb/cockroach/pkg/jobs.(*Registry).stepThroughStateMachine.func2
I221206 16:13:40.558339 1010807 jobs/registry.go:1446 â‹® [n1] 376 +  | 	github.com/cockroachdb/cockroach/pkg/jobs/registry.go:1413
I221206 16:13:40.558339 1010807 jobs/registry.go:1446 â‹® [n1] 376 +  | github.com/cockroachdb/cockroach/pkg/jobs.(*Registry).stepThroughStateMachine
I221206 16:13:40.558339 1010807 jobs/registry.go:1446 â‹® [n1] 376 +  | 	github.com/cockroachdb/cockroach/pkg/jobs/registry.go:1414
I221206 16:13:40.558339 1010807 jobs/registry.go:1446 â‹® [n1] 376 +  | github.com/cockroachdb/cockroach/pkg/jobs.(*Registry).runJob
I221206 16:13:40.558339 1010807 jobs/registry.go:1446 â‹® [n1] 376 +  | 	github.com/cockroachdb/cockroach/pkg/jobs/adopt.go:412
I221206 16:13:40.558339 1010807 jobs/registry.go:1446 â‹® [n1] 376 +  | github.com/cockroachdb/cockroach/pkg/jobs.(*Registry).resumeJob.func1
I221206 16:13:40.558339 1010807 jobs/registry.go:1446 â‹® [n1] 376 +  | 	github.com/cockroachdb/cockroach/pkg/jobs/adopt.go:332
I221206 16:13:40.558339 1010807 jobs/registry.go:1446 â‹® [n1] 376 +  | github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTaskEx.func2
I221206 16:13:40.558339 1010807 jobs/registry.go:1446 â‹® [n1] 376 +  | 	github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:470
I221206 16:13:40.558339 1010807 jobs/registry.go:1446 â‹® [n1] 376 +  | runtime.goexit
I221206 16:13:40.558339 1010807 jobs/registry.go:1446 â‹® [n1] 376 +  | 	GOROOT/src/runtime/asm_amd64.s:1594
I221206 16:13:40.558339 1010807 jobs/registry.go:1446 ⋮ [n1] 376 +Wraps: (8) expected descriptor doesn't match with found descriptor: ‹StoreColumnNames[1]: "min_execution_latency" != "sampling_probability"›
I221206 16:13:40.558339 1010807 jobs/registry.go:1446 ⋮ [n1] 376 +  | ‹StoreColumnNames[2]: "expires_at" != "min_execution_latency"›
I221206 16:13:40.558339 1010807 jobs/registry.go:1446 ⋮ [n1] 376 +  | ‹StoreColumnNames[3]: "sampling_probability" != "expires_at"›
I221206 16:13:40.558339 1010807 jobs/registry.go:1446 â‹® [n1] 376 +Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *withstack.withStack (6) *errutil.withPrefix (7) *withstack.withStack (8) *errutil.leafError
E221206 16:13:40.558569 1010807 jobs/adopt.go:417 ⋮ [n1] 377  job 820093526208806915: adoption completed with error non-cancelable: running migration for 1000022.1-20: error while validating descriptors during operation create-stmt-diag-reqs-v3-index: expected descriptor doesn't match with found descriptor: ‹StoreColumnNames[1]: "min_execution_latency" != "sampling_probability"›
E221206 16:13:40.558569 1010807 jobs/adopt.go:417 ⋮ [n1] 377 +‹StoreColumnNames[2]: "expires_at" != "min_execution_latency"›
E221206 16:13:40.558569 1010807 jobs/adopt.go:417 ⋮ [n1] 377 +‹StoreColumnNames[3]: "sampling_probability" != "expires_at"›

@srosenberg
Copy link
Member

srosenberg commented Dec 7, 2022

The same migration step errors out on n2 and n3.

n2,

I221206 16:13:40.522861 1010804 jobs/adopt.go:245 â‹® [n1] 369  job 820093526208806915: resuming execution

n3,

I221206 16:11:22.570235 699259 jobs/adopt.go:245 ⋮ [n3,intExec=‹set-version›,migration-mgr] 943  job 820093526208806915: resuming execution

However, n4 doesn't get a chance; why?

I221206 16:11:31.515121 604354 upgrade/upgrademanager/manager.go:593 ⋮ [n4,intExec=‹set-version›,migration-mgr] 258  found existing migration job 820093526208806915 for version 1000022.1-20 in status running, waiting
I221206 16:11:31.515180 604354 upgrade/upgrademanager/manager.go:519 ⋮ [n4,intExec=‹set-version›,migration-mgr] 259  waiting for ‹Upgrade to 1000022.1-20: "update system.statement_diagnostics_requests to support sampling probabilities"›

Note, the timestamps suggest the following job execution ordering n3, n1, n2.

  • Why wasn't the job resumed on n4?
  • Why didn't the nodes panic after the failed migration?

No other upgrade progress is made past 22.1-18, and the 5-min timeout expires at 16:16,

16:16:13 test_impl.go:347: test failure #1: (test_impl.go:291).Fatal: 2: expected version 1000022.2-10, got 1000022.1-18

@srosenberg
Copy link
Member

It appears we have hit an inconsistent state during schema migration. @postamar Could we get some assistance from your team to (dis)qualify this as a bug?

@cockroach-teamcity
Copy link
Member Author

roachtest.tpcc/mixed-headroom/multiple-upgrades/n5cpu16 failed with artifacts on master @ 8165e3974c10e88b6ae11c6255872ea16f3a67e3:

test artifacts and logs in: /artifacts/tpcc/mixed-headroom/multiple-upgrades/n5cpu16/run_1
(test_impl.go:291).Fatal: 2: expected version 1000022.2-10, got 1000022.1-18
(test_impl.go:291).Fatal: monitor failure: monitor task failed: output in run_160737.027425271_n5_cockroach_workload_run_tpcc: ./cockroach workload run tpcc --warehouses=909 --histograms=perf/stats.json  --ramp=5m0s --duration=1h40m0s --prometheus-port=2112 --pprofport=33333  {pgurl:1-4} returned: context canceled

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=16 , ROACHTEST_encrypted=true , ROACHTEST_fs=ext4 , ROACHTEST_localSSD=true , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

This test on roachdash | Improve this report!

@srosenberg
Copy link
Member

Same issue as the previous run. This likely implies that it was some change that was merged into master on Dec 5.

@cockroach-teamcity
Copy link
Member Author

roachtest.tpcc/mixed-headroom/n5cpu16 failed with artifacts on master @ ec095bc2fdbe4e518b076db20e4920fab67222bf:

test artifacts and logs in: /artifacts/tpcc/mixed-headroom/n5cpu16/run_1
(test_impl.go:291).Fatal: 2: expected version 1000022.2-10, got 1000022.1-16
(test_impl.go:291).Fatal: monitor failure: monitor task failed: output in run_163707.177936246_n5_cockroach_workload_run_tpcc: ./cockroach workload run tpcc --warehouses=909 --histograms=perf/stats.json  --ramp=5m0s --duration=1h40m0s --prometheus-port=2112 --pprofport=33333  {pgurl:1-4} returned: context canceled

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=16 , ROACHTEST_encrypted=true , ROACHTEST_fs=ext4 , ROACHTEST_localSSD=true , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

Same failure on other branches

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

roachtest.tpcc/mixed-headroom/multiple-upgrades/n5cpu16 failed with artifacts on master @ ec095bc2fdbe4e518b076db20e4920fab67222bf:

test artifacts and logs in: /artifacts/tpcc/mixed-headroom/multiple-upgrades/n5cpu16/run_1
(test_impl.go:291).Fatal: 4: expected version 1000022.2-10, got 1000022.1-18
(test_impl.go:291).Fatal: monitor failure: monitor task failed: output in run_170041.255377930_n5_cockroach_workload_run_tpcc: ./cockroach workload run tpcc --warehouses=909 --histograms=perf/stats.json  --ramp=5m0s --duration=1h40m0s --prometheus-port=2112 --pprofport=33333  {pgurl:1-4} returned: context canceled

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=16 , ROACHTEST_encrypted=false , ROACHTEST_fs=zfs , ROACHTEST_localSSD=true , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

roachtest.tpcc/mixed-headroom/multiple-upgrades/n5cpu16 failed with artifacts on master @ 24854994805cede37e6845ee2a94e10272b5506b:

test artifacts and logs in: /artifacts/tpcc/mixed-headroom/multiple-upgrades/n5cpu16/run_1
(test_impl.go:291).Fatal: 2: expected version 1000022.2-14, got 1000022.1-18
(test_impl.go:291).Fatal: monitor failure: monitor task failed: output in run_163521.299982061_n5_cockroach_workload_run_tpcc: ./cockroach workload run tpcc --warehouses=909 --histograms=perf/stats.json  --ramp=5m0s --duration=1h40m0s --prometheus-port=2112 --pprofport=33333  {pgurl:1-4} returned: context canceled

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=16 , ROACHTEST_encrypted=true , ROACHTEST_fs=ext4 , ROACHTEST_localSSD=true , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

roachtest.tpcc/mixed-headroom/multiple-upgrades/n5cpu16 failed with artifacts on master @ f2b00e8039af6ea8887ec124dad8daf19da6fbf1:

test artifacts and logs in: /artifacts/tpcc/mixed-headroom/multiple-upgrades/n5cpu16/run_1
(test_impl.go:291).Fatal: 1: expected version 1000022.2-14, got 1000022.1-16
(test_impl.go:291).Fatal: monitor failure: monitor task failed: output in run_165541.826903916_n5_cockroach_workload_run_tpcc: ./cockroach workload run tpcc --warehouses=909 --histograms=perf/stats.json  --ramp=5m0s --duration=1h40m0s --prometheus-port=2112 --pprofport=33333  {pgurl:1-4} returned: context canceled

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=16 , ROACHTEST_encrypted=true , ROACHTEST_fs=ext4 , ROACHTEST_localSSD=true , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

roachtest.tpcc/mixed-headroom/multiple-upgrades/n5cpu16 failed with artifacts on master @ c050c9b4b57ecf2ceb5d449c31c617fe12c920e0:

test artifacts and logs in: /artifacts/tpcc/mixed-headroom/multiple-upgrades/n5cpu16/run_1
(test_impl.go:291).Fatal: 1: expected version 1000022.2-14, got 1000022.1-18
(test_impl.go:291).Fatal: monitor failure: monitor task failed: output in run_163750.181226352_n5_cockroach_workload_run_tpcc: ./cockroach workload run tpcc --warehouses=909 --histograms=perf/stats.json  --ramp=5m0s --duration=1h40m0s --prometheus-port=2112 --pprofport=33333  {pgurl:1-4} returned: context canceled

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=16 , ROACHTEST_encrypted=false , ROACHTEST_fs=ext4 , ROACHTEST_localSSD=true , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

roachtest.tpcc/mixed-headroom/multiple-upgrades/n5cpu16 failed with artifacts on master @ 942a4d468e9c8ad0ef45a7be33f0a326dfb19fef:

test artifacts and logs in: /artifacts/tpcc/mixed-headroom/multiple-upgrades/n5cpu16/run_1
(test_impl.go:291).Fatal: 3: expected version 1000022.2-14, got 1000022.1-18
(test_impl.go:291).Fatal: monitor failure: monitor task failed: output in run_164224.956315175_n5_cockroach_workload_run_tpcc: ./cockroach workload run tpcc --warehouses=909 --histograms=perf/stats.json  --ramp=5m0s --duration=1h40m0s --prometheus-port=2112 --pprofport=33333  {pgurl:1-4} returned: context canceled

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=16 , ROACHTEST_encrypted=true , ROACHTEST_fs=ext4 , ROACHTEST_localSSD=true , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

This test on roachdash | Improve this report!

ajwerner added a commit to ajwerner/cockroach that referenced this issue Dec 13, 2022
…_idx

The migration used a different column ordering than the descriptor in the
bootstrap schema. The value in the bootstrap schema is the value used to
determine whether the migration succeeded successfully. In general, you
can hit this bug if you upgrade from 22.1->22.2 and then you create the
index with the migration but crash before the index is fully created. In
that case, the code will think that it's the wrong index. This should be
rare, but would be problematic. Now we've made them match.

This change also fixes the roachtest which checks that the system
schema looks correct to check on what happens when you upgrade from a previous
snapshot. The problem with the test is that it read the strings before they
were assigned.

Fixes cockroachdb#93133

Release note (bug fix): Fixed a rare bug which could cause upgrades from 22.1
to 22.2 to fail if the job coordinator node crashes in the middle of a specific
upgrade migration.
@srosenberg
Copy link
Member

Why wasn't the job resumed on n4?

The job was in fact going to be retried; this can be seen from the jobs table in the debug.zip. However, due to the exponential backoff, the subsequent retry attempt didn't happen before the test timed out. Furthermore, we don't log the wait duration until the next retry, so the logs don't tell the full story. Also, note that the migration (step) job is retried for up to 24 hours.

Why didn't the nodes panic after the failed migration?

That's because the job is retriable (and non-cancellable), meaning if a migration step fails for any reason, it's going to be retried with exponential backoff until the time limit (currently 24 hours) expires.

@cockroach-teamcity
Copy link
Member Author

roachtest.tpcc/mixed-headroom/multiple-upgrades/n5cpu16 failed with artifacts on master @ a80652b2e4691ea76ea49e797b1b9e0998e1d61f:

test artifacts and logs in: /artifacts/tpcc/mixed-headroom/multiple-upgrades/n5cpu16/run_1
(test_impl.go:291).Fatal: 1: expected version 22.2, got 22.1-16
(test_impl.go:291).Fatal: monitor failure: monitor task failed: output in run_202424.799324104_n5_cockroach_workload_run_tpcc: ./cockroach workload run tpcc --warehouses=909 --histograms=perf/stats.json  --ramp=5m0s --duration=10m0s --prometheus-port=2112 --pprofport=33333  {pgurl:1-4} returned: context canceled

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=16 , ROACHTEST_encrypted=false , ROACHTEST_fs=ext4 , ROACHTEST_localSSD=true , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

This test on roachdash | Improve this report!

craig bot pushed a commit that referenced this issue Dec 13, 2022
89442: build: delete `vendor` submodule r=knz,dt a=rickystewart

The `vendor` submodule is not really necessary for anything any more.
For the Bazel build, we have mirrored all of our dependencies, so
vendoring provides no additional value. `go` tooling is also generally
happy to point to the module cache for e.g. go-to-definition. Dealing
with the submodule is a pain, so it behooves us to get rid of it.

For `make` builds, the `vendor` directory will be synthesized
automatically. It is now a `gitignore`'d directory. You can also
still `make vendor_rebuild` if you want to force synthesizing the
directory. Tooling can be updated to just not use `-mod=vendor` and the
`go` module cache should be used in its place transparently.

Epic: None
Release note: None

92694: server, ui:  add multitenant login/logout and tenant dropdown r=Santamaura a=Santamaura

ui, server: add multitenant login/logout and tenant dropdown

This patch enables login/logout for all tenants on the cluster
by fanning out the incoming requests to each tenant server.
Multitenant login introduces a new multitenant session cookie
with the format as <session>,<tenant_name,<session2>,<tenant_name2>
etc. The admin ui displays a dropdown with a list of tenants
the user has successfully logged in to. Selecting a different
tenant sets the tenant cookie to the selected tenant name
and reloads the page. If the cluster is not multitenant, the
dropdown will not display.

Release note (ui change): added a top-level dropdown
on the admin ui which lists tenants the user has logged
in to. If not multitenant, the dropdown is not displayed.

Epic: https://cockroachlabs.atlassian.net/browse/CRDB-14546

93487: upgrades: fix upgrade to add statement_diagnostics_requests.completed… r=ajwerner a=ajwerner

…_idx

The migration used a different column ordering than the descriptor in the bootstrap schema. The value in the bootstrap schema is the value used to determine whether the migration succeeded successfully. In general, you can hit this bug if you upgrade from 22.1->22.2 and then you create the index with the migration but crash before the index is fully created. In that case, the code will think that it's the wrong index. This should be rare, but would be problematic. Now we've made them match.

This change also augments the roachtest which checks that the system schema looks correct to check on what happens when you upgrade from a previous snapshot. That matters here because the migration in question still exists on master, and is not idempotent. We should have found that, but didn't because we need multiple steps in the upgrade. We can get that pretty cheaply.

Fixes #93133

Release note (bug fix): Fixed a rare bug which could cause upgrades from 22.1 to 22.2 to fail if the job coordinator node crashes in the middle of a specific upgrade migration.

Co-authored-by: Ricky Stewart <rickybstewart@gmail.com>
Co-authored-by: Santamaura <alexsantamaura@gmail.com>
Co-authored-by: Andrew Werner <awerner32@gmail.com>
@srosenberg srosenberg removed the release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. label Dec 13, 2022
@craig craig bot closed this as completed in 2a118aa Dec 13, 2022
blathers-crl bot pushed a commit that referenced this issue Dec 13, 2022
…_idx

The migration used a different column ordering than the descriptor in the
bootstrap schema. The value in the bootstrap schema is the value used to
determine whether the migration succeeded successfully. In general, you
can hit this bug if you upgrade from 22.1->22.2 and then you create the
index with the migration but crash before the index is fully created. In
that case, the code will think that it's the wrong index. This should be
rare, but would be problematic. Now we've made them match.

This change also fixes the roachtest which checks that the system
schema looks correct to check on what happens when you upgrade from a previous
snapshot. The problem with the test is that it read the strings before they
were assigned.

Fixes #93133

Release note (bug fix): Fixed a rare bug which could cause upgrades from 22.1
to 22.2 to fail if the job coordinator node crashes in the middle of a specific
upgrade migration.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. T-testeng TestEng Team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants